 Hello everyone, welcome to our talk and I'm just going to present myself. My name is Dio Comas. I'm the head of security at SourceGraph. I've been protecting cloud-native environments and Kubernetes environments in many companies. I love all this security that can be enabled and the cloud-native projects. So I'd love to talk about all this here. Hi, Brian. I'm Christoph. I work at DataDog, mostly focusing on cloud security and open source topics. Take it away. Thank you. So today we would like to talk about common pitfalls in managed Kubernetes environments and some traps that you can find yourself and get into because it's many things that can happen there. Just to start, how many of you are working with Amazon EKS? Okay, quite a few. How many of you Google GKE and Azure? Oh, quite a few. How many of you multi-cloud? Well, that's great. Thank you. We'll have all of you. Thank you. So at the end, we're going to show you that we're releasing a tool that is going to help you to mitigate and handle some of the pitfalls and problems that we're going to show you here. So stay tuned because it's going to help you definitely. Cool. So we're going to be following the journey of Kate today. So Kate is a cloud-native software engineer and she's going to build a cat classifier app. So it's a very simple app. You give it a picture of a cat and it gives you back what kind of cat it is. So Kate, she has an AWS account with an EKS cluster already deployed and she's going to deploy her application on the cluster. So she authenticates to AWS. She runs the right command to get her kube config. And she's a bit confused because she doesn't have any kind of access. She has access denied, but she's a full admin to an AWS account. And that's the first thing that we generally run into. The first trick is that when you're an admin on AWS, you don't necessarily have any kind of permissions on your Kubernetes cluster. And this is because permissions are managed inside of the cluster, not through AWS IAM. So this also adds a complexity that whoever creates the AWS EKS cluster has invisible and immutable rights to it. So if it was maybe Kate's friend, Jenny, who created the cluster two years ago, she will still be a full admin of that, but you don't see it anywhere and you cannot remove that. So this is the default config map that you have in the EKS cluster. It's a config map that basically maps AWS identities to Kubernetes groups. And in this case, this is the default one that tells you, okay, I want my nodes to have system nodes permissions. But as we see here, Kate didn't create the cluster. She's not in there. So she doesn't have any kind of permissions. This has some interesting implications for incident response as well, not only security, but also, you know, when something starts to break because you don't want to find out that you don't have access at free IAM in the morning when you get paged. And also it means that it's very hard actually to answer who is admin to my EKS cluster because you would have to go back in time, find out who created your cluster, and you know that they are still administrators. So that's an interesting challenge. So now we want to give Kate admin access to our cluster. So even though she's an admin to our AWS account, we need to explicitly map her identity. So we say, okay, if you are an account admin in AWS, you should also be a system master in Kubernetes. And now she has the right access and she can start to work. That's for EKS in over clouds. So in Azure, you can do both. So you can both manage access to the cluster through Azure air back. You have roles that are specific to the Kubernetes API that are going to give you different levels of access in there. You can also give access from a cluster or a role binding. So you just create a cluster role binding and you define, you know, this Azure ID group should have this level of access in there. In GCP, you can do both as well. So you have GCP roles, Kubernetes engine, cluster viewer cluster admin, et cetera. But the main issue with those is that you can only assign them at the project level. So some people want to give access at specific cluster level. And so for that, you would have to use also a role binding or a cluster role binding where you reference Google groups or Google users. No, she has an application running in EKS. So Diego, what's next? Yeah. So now that Kate is able to interact with the cluster and she wants to deploy an application that as we said before, the cat classifier is going to receive an input for the user to send the cat image. And then, for example, with this type of request, the cat classifier application will fetch more information externally to validate the type of grid. But this is the happy path. Sometimes when there's a malicious attacker, they can try to trick the application and try to get information from what is not supposed to happen and have an external access. And that's one way of defining the server side request for which is one of the elements in the latest version of the all was top 10. And this is very problematic because if there's not enough controls or parsing properly sanitizing the application, a malicious attacker can input the instance metadata and service to fetch information that they shouldn't have access. The attacker can try to request this URL and then the cat classifier application will return the information that it's available from within the pod. In this case will be all the credentials of the note. So now we'll have the server side request forgery and to the instant metadata service. And you need to remember that it's one of the most common causes of public cloud breaches, misconfigurations, and these things that some applications are not handling properly. So it's important that you are aware because it's very appealing for attackers. They can steal the worker note cloud credentials, they can authenticate this cluster as a worker note, but also impersonate any pod that it's running within the note. So this is a diagram of how the attacker via the pod A is trying to request the credentials to be able to use the note instance role. So then once the information is retrieved, the attacker will be authenticated in the cluster as a system notes, and then it's going to be able to create a new token for another pod that it's running in the same note. So then it will have in their hands a service account token for pod B, and then it will be able to authenticate the spot B and access everything that pod B has permissions access or anything. So that's something you need to be very aware of the limitations. We're going to have a quick demo. So I really wanted this to be a recorded demo, but someone told me that you have to do a live demo for your first coupon. So let's do that. So we are here, we are here with an application that's running inside of EKS, which you can use to test any website. So you give it google.com and it's going to tell you, okay, this works. I can fetch google.com. If you pass it the URL of the instance metadata service, we see that here we can access information. Again, this is the metadata service of the node. So it means that if you do latest metadata, you see here that we have information about the node, including security credentials, which are the credentials of AWS attached to the instance role. And this is the role attached to the worker node that runs my pod. So I can just do that. And boom, I get back AWS credentials of the node. So what can I do with that? I open my shell. I'm just going to put that into a JSON file. Then I'm going to export that into my shell. So access ID, secret key, session token. And what do I get here? Well, I get to know that I'm the attacker on my own laptop. I'm authenticated as the node against AWS, which means that by design, I'm also authenticated as the node against the cluster. So what can I do with that? First thing, I can get the pods. I can see the pods that are running in the default time space. Only one application. Can get all the rest of the pods. Zooming a little bit hard here. But basically, I have a few pods that I can try to impersonate. Why? Because I'm the node. And as the node, I'm able by design to create service account tokens for the pods that run on me. That's because the kubelet is responsible to create the service account tokens and inject that into the pods. So I can just say, these are interesting microservices. I'm going to try and impersonate the inventory service. So I'm going to create a token for the inventory service. This is just saying, I'm doing that for a pod. This is the name of the pod. And this is the UID of the pod that you can just find from the pod definition as well. So we do that. We get back a JSON web token. And if we decode it, we see that the subject is indeed the inventory service account. So what can we do with that? If we do kubectl, what can I do with this token? Because that's the question. We compromise the service account. What can we do with that? Well, we can see here that we can see it, but we can see that it's an admin service account, which means that with a very simple vulnerability, you are able to pivot from being unauthenticated against a web application to being a full cluster admin. And privileged service accounts are pretty common. Generally, you have a bunch of daemon sets that run as privileged for various reasons. Not only that, but you also have some level of access against AWS. So as we'll see, it's somewhat limited. But you can still do things like show me all the instances in the account, including, you know, the network configuration, what is exposed to the internet, the user data that they have, which sometimes contains passwords. You can also do various things like look at the network configuration. So, you know, all sorts of interesting things that you probably don't want an attacker to do. Now, back to the slides. What did we do? Because it was a bit fast. So first thing, we compromised the node credentials from an application level vulnerability, right? Which means that we were authenticated both against AWS and against Kubernetes as a node by design. Which means that also by design, we were able to create a service account token for any pod running on the node. And we found one with cluster admin rights. So, a bit on the impact and how easy or not it is to exploit this. In AWS, the worker nodes, they have an IAM roles, which is limited, but still has some privileges. Three interesting things. You can pull all the container images in the account. It's by design. You are supposed to be able to pull stuff. But, you know, this is in all the region in your account. So, you probably don't want an attacker to do that either. You can describe and list and nuke all the network interfaces in the account as well. That could be funny. And again, you can list all the clusters, the EC2 instances, the VPC. So, it's not terrible, but it's not great either. And by default, unless you have the version two of the instance metadata service enforced, this is going to work. So, by default, it's going to be exploitable in AWS. In Azure, by default, your nodes don't have permissions. So, your worker nodes, they have a managed identity assigned. But, by default, they don't have anything attached to this role. It's going to be iteratively added. So, when you enable, for instance, the integration with Azure container registry, they add the permission to say they should be able to pull from there. So, the impact is less important. And also, sorry, Diego. Also, when you try to talk to the instance metadata service of Azure, they require that you add a specific header, metadata, to show that you are intending to call it and not something else. So, here I have an application that's vulnerable to SSRF as well. And if you try to call the Azure metadata service, we see that we get a bad request response. For GCP, it's a bit different. So, when you create a VM and be supplied to worker nodes as well, it uses the default service account with a limited scope. But this scope does include DevStorage.readOnly, which means that you can read all the GCS buckets and the BigQuery tables. And when you think about it, that's where your data is, right? So, it's not great. And it also requires to add a specific header to show that you are intending to call it. Now, these are interesting countermeasures. These headers, they're interesting to counter exploitation of server-side request for GERA, right? But it doesn't solve the issue that you don't want someone from inside the pod to be able to steal your node credentials. So, typically, if you exploit a command injection, a SQL injection or something like that, you are still going to be able to call that by design, right? Here, I have an application that's vulnerable to command injection. And if I can run my own command, obviously, I can just add the header that they want. So, the point being, these headers that you have to add, they are very useful to prevent exploitation of server-side request for GERA, where someone can request, can make your application request any URL they want, but they don't solve the overall problem of the pod, can steal all the credentials of the node. And attackers love credentials, right? The screenshot on the top is from Team TNT. Team TNT is an attacker group that loves cloud environments. And so, here, you can see they are basically scanning for open cubelets. And when they find one, they run this command, which is essentially running a new pod, which is basically just doing a W get against the instance metadata service to steal node credentials. This is another report by Mangeant, where they show a threat actor that's exploiting known vulnerabilities in applications exposed to the Internet to just do the same. So, it's very popular because it's very easy. You get back AWS credentials that you can take on your laptop and take it from there. So, how do we mitigate the pods that are stealing their node's credentials and pivoting just like we did in the demo? A few things. In GCP, it's very simple. You just have to enable GKE workload identity. It's by default for autopilot. It's going to be by default soon for all the GKE clusters, for all the new GKE clusters. And for Azure and AWS, it's a bit more complex. We'll see why in a bit, but basically, you have to explicitly block the instance metadata service. You have to block the network access to it. And we'll see why in a bit. So, Diego, now we have our application in the cloud. How do we make it use actually the cloud? Thank you. So, yeah, we want our application to be scalable, but also leverage the cloud services like all the scalability of S3 buckets or GCS or other services that we want to leverage. So, our application needs to authenticate and it needs to interact with that. So, let's say, how do I authenticate my application, cat classifier against the cloud provider API? So, I'm a little bit lazy, but I want a clever mind to tell me something to be more productive. So, I ask chatGPT and I tell you, how do we get my Kubernetes about access to upload things to S3 buckets? So, then chatGPT say, well, just create an IAM user and then you will be able to have right access. Okay. So, the default process then, someone like Kate will create the Kubernetes secrets with the specific data which will be the static credentials to access S3 with the specific permissions of the IAM user. But we know that static and lonely credentials usually end up being lived if they're not taking care properly. So, you have seen many cases that some companies committing secrets in their GitHub repos, others in their CI CD pipelines that are exposed lately, so this is not the ideal way today. So, they are easier and more secure ways to do it. Interacting with the cloud provider secrets is much better. We have an example, like for example, the NPM security breach about they have a static token, someone was able to get it and then be able to access to other repos and using the cloud credentials to get more information from it. So, remember, authenticating against the cloud services, it's better to use a platform managed identity. So, your workload identified natively with your cloud service instead of static credentials all the time. Using AWS IAM roles for service accounts, you can also use workload identity and GKE. And also, this is deprecated by the Azure New Preview Azure AD workload identity. So, this is our recommendation to use the modern way, easier and more secure for your workloads to authenticate. And here we can see that how you can include this information for the workload in your service account of your pod that is going to be used. We define in an annotation that it's going to specify the role that you want your application to use. In this case, you can see the role of CAD classifier. And then that can be added in the pod spec, the service account name that you want to use. This is automatically injected into the file of the system of the pod. So, you can see that internally it's defined the right service account with the right CAD classifier name of the role. So, then even if you try to validate executing into the pod, you can see that when you interact with AWS to authenticate, you will have the specific error that it's the role for CAD classifier for your application. So, best to say that when you do that, as a developer, as someone who runs applications, you don't need to care about how this works. No, let's still have a look at how it works. So, basically it's just a matter of exchanging your Kubernetes credentials to get some AWS credentials, right? And the way it works is with the AWS STS Assume role with Web Identity API, which is basically a way to say, give me credentials for this role, and I'm giving you back a JSON WebToken for that. So, here we say, okay, this is the JSON WebToken that I have. It's signed by my Kubernetes cluster, and this is the role that I want. And if the role trust policy allows the right pod, it's going to give me back AWS credentials. So, it's really a matter of exchanging Kubernetes credentials and getting back Cloud credentials. So, back to chat GPT. I think we can say, yeah, but it's an AI. It doesn't know about IAM roles for service accounts. The funny thing is that if you ask it, okay, how can I make this more secure? It's going to tell you, well, just use IAM roles for service accounts, which I think begs the question, why didn't you tell me from the start? And then it tells you, I don't make assumptions about your security requirements, which I think is a little bit easy. And it begs the question of, we are using AI. It's great. We need to understand also that, you know, it was trained on a massive data set that has insecure things, and it tends to reproduce the insecure patterns that it sees. So, it's a great tool. I use it every day, but we just need to be aware of that and deal with it. In our clouds, in Azure, it's very similar. So, you just annotate your service account with the identity you want, and you get back your JSON web token in your pod that you can exchange similarly. So, very similar. I won't extend on that. In Google Cloud, it's very different. So, when you turn on workload identity, it creates a diamond set in your cluster, which is going to create three pods, so, one pod per node, which runs something called the GKE metadata server, which, as far as I know, is not open source. It's pretty much a black box. And these are basically going to hijack all the requests from the pod going to the metadata service. As we can see here, this is going to create magically some IP table rules on the node that hijack the metadata service and send it back to the pod. And the pod, the metadata server pod, binds to a link local IP here, which means that when you are from inside the pod, and you call the metadata service, you're going to hit these pods instead. Now, to sum up, how do we prevent our pods from stealing our node credentials? And this builds up on our previous slide. In GKE, enabling workload identity is enough. Why? Because even if your pods can't hit the metadata service, they're basically automatically going to get back credentials for what they need, not the node credentials. In AWS and Azure, it's very different. What you get, as we saw with an extra JSON web token on your file system, but you can still call the metadata service, right? So you have to explicitly block it with a network policy or with IP table rules, something like this. So, again, in GKE, workload identity is enough. AWS and Azure ID, you have to explicitly block it. One exception is that if you have pods that use host networking, by design, they are not constrained by this, so they will be able to call it, which kind of makes sense. Now, Diego, we want to call an external service called CAT GPT. So how do we do that? Yeah, thank you. So we continue with our application, and then apart from leveraging the cloud services that we authenticate with workload identity or other means, we also want to leverage third-party services to do more things with our application. So the way that we're going to interact with that is popular options nowadays are using things like external secrets operators or Kubernetes secrets, store CSI driver to leverage the Google secrets manager or the other cloud provider secrets manager. And that's very popular because it helps you with all the operations and all the updates. So the operator pattern creates this type of either demon set or other pods that will help you automate a lot of things. But at the same time, we need to be careful because it can be very risky depending how it's set up. So a lot of problems in the clouds, you might know it's misconfiguration. So if, because it's more convenient and it's faster for some teams, you just put star, star in the permissions, it can be very risky. And misconfiguration of the policy of the roles can be something like this. You're going to allow that operator to list all the secrets, get all the secrets value from any type of resource. So if someone compromises that part of that operator following the ways that we said before, you are going to have access to the whole secrets that are available. Okay. So if the malicious attackers compromise the secret operator, they're going to have access to the crunchy roles in all the secrets that you have in that account. So we're going to go for an example. Let's push it. So we continue from the previous amount that we have the role of the node and we can validate that we are in the system node role within the cluster. Then now we can identify and say, oh, let's see what are the pods that are available? Is there anything interesting within the cluster? So maybe, yeah, these external secrets, it's very appealing for an attacker. Let's see, we can get something from here. So now we can, as we said before, and try to create them, let's get the service account to see if there's information. It's not available directly, but we can get it from the pods information we all put in the YAML. And then we can see the specific error and role that it's assigned to that pod. So now let's create a new service account that it's going to have the same role that external secrets. So we create a new token and we store this information. And we are going to leverage this information to create, we confirm that here we have the specific service account for external secrets. And now we can, thanks to that token, authenticate against AWS with the specific role that we have from external secrets and get the credentials to be able to use that. So now we export the access keys, access keys, using token, and we're going to be able to leverage that to do anything that the external secrets can access from AWS. We're going to do it in the same way. Now that we are authenticating against the AWS, we have confirmation that our role is external secrets role. So happy days. Let's get the secrets. So let's see what we can have. So we see that from the secrets manager say AWS, we're going to have a lot of juicy stuff. And there are common things like GitHub API keys. I think that that will be the next step and attacker will say, okay, I want to get that GitHub secret, give me the specific value. And here in secret string, you can see the value. And now the attacker can go into other places and try to pull the repo contents. And probably there will be other credentials there as well. So continuing the demo. Initially, we compromised the node through a vulnerability of the application. And we got the credentials for the node by the instant metadata service. We enumerated the pods because within the node, you can see what it's running. And then we created a new service account for that juicy pod that has a lot of permissions presumably. So we changed those permissions. Well, that service account to get the permissions to interact with AWS and have the credentials. And you get the secret. You get the secret. And you get the secret. So everyone is happy now. But not the people that leaks that it's bridge. Anyway, so operators are bridging the gap between the cluster and the cloud. Some are very powerful. They're very useful. They are very convenient. But also you need to be aware that managed cloud infrastructure through the operator can be risky because you can see here some of these operators, you can define a policy that they are administrator access and they can do a lot of things in your account. So you need to be careful of the policies and the roles and everything that you assign to the operators. It makes your life easier in some parts, but it can be riskier in others. So what you can do to minimize the impact of compromise operators, try to assign minimal cloud privileges to the operators, things like tag based access control that you can define or where the operators are running. You can use something like I am live where it identifies the interactions of your workloads with the cloud provider. And then it helps you to define the policy, the minimal one that it's needed for that workload. And you also need to know how that can be pivoted to other places. Cool. So wrapping up, but we still have a few minutes, so not too much. If you have to listen for one minute, it's going to be now. So key takeaways. Don't hard code cloud credentials in workloads, in humanities. I think in 2023, there's no good reason to do that because there are a lot of mechanisms that work well. So, you know, use something that's simpler and more secure. You don't have to think about, oh, I need one. I am user for production, one for dev, one for it's going to be managed by the platform. You only have temporary credentials that you don't need to worry about leaking. Restrict the access that pods have to the metadata service. Again, for IWS and Azure, you need to block it explicitly with the network policy. For GCP, just enable workload identity. And finally, when you assign permissions to workloads in the cloud, you need to be mindful of what you are doing because it creates some kind of pivot point. Now, that's a rare picture of myself trying to figure out, okay, I have a bunch of pods, they have service accounts, they are tagged like this, and they have a bunch of AWS roles and they have a trust policy. So which workload can take which role in AWS? And so for these challenges, we are actually, so there's this one, there's also, you know, you end up in a cluster, you're like, do I have any hard-coded secrets here? Where do you look? How do you know, like, but you don't have false positives? And how do you test if you properly block access to the pod, from the pod to the instance metadata service? And so just this morning, we are releasing a new tool that's called the Manage Kubernetes Auditing Toolkit. And the idea is that there are a lot of tools that allow you to audit your cloud. Lots of them allow you to audit Kubernetes, but I think there's a gap in the middle when you want to find cloud-specific issues in Kubernetes clusters. And so that's what we are trying to do here. Three main features that we're going to see in a live demo because why not, let's push it. So the name is MCAT, which stands for Manage Kubernetes Auditing Toolkit. It's a single binary, you can set it easily. And it has a few features and for now it only supports EKS. First thing we can do is look at the role relationships and you'll see what this does. So we say, show me the role relationship that I have in my cluster and this is just to export it as graph this format and render it in a PNG. And I do that, it's going to look at the roles in my account and the pod and it's going to tell me, okay, so this is what you have. You have a bunch of pod here, they are running in this namespace and they can assume these specific roles in AWS. So it's not going to give you exactly what you need to do, but at least it allows you to get a visualization of what's going on and if you need to worry about something. For instance, here I have a bunch of pods that can assume this specific role, maybe I want to check if it's not too privileged, things like that. Second thing is going to be able to do is find secrets. And you know, this looks simple, but actually it's pretty hard because you can have a lot of false positives. So we try to minimize that by saying we show the secrets that you have in config maps, in secrets, in pod definitions. So we show only AWS access keys and we only make sure that we make sure that you have an access key and a secret key in the same resource. So if you only have something that looks like a secret key, you won't have to be alerted. And finally, you can also test if you properly blocked the access to instance metadata service. So it's going to run a pod, use curl inside that and it's going to tell you, hey, be careful, IMDS is not blocked and any pod can get credentials for the role of the cluster, which is this one. So that's the goal of this tool. A few things, I think it's really important that when we release something, we're able to show like how it compares to other over tools because we don't want to just be putting around, you know, some tools and then the community struggles to understand how we compare to other ones. So we try to do this work for you. It's in the rid me of the tool, by the way. Go ahead and try it out. It's a single binary. So you can install it with homebrew. There are pre-compiled binaries for Linux, Mac and Windows as well. And there is a roadmap. So first thing we're going to try and do is visualize pods that are externally exposed for load balancer. Second one is add support for GKE, which by design is going to be different and add detection for more cloud specific secrets. So for instance, we want to see if you have a GCP service account key in your IWS secret, sorry, in your IWS cluster, I think it makes sense to flag that as well. So thanks a lot. This is the link to the session where you can see the slide. And please go ahead and leave some feedback. I'm interested to hear if you liked the talk, if it was too security focused, to more cloud focused, if you would have liked some more multi-cloud content, more focused on IWS, we're probably going to release some blog posts on that. So it's really super helpful. Thank you.