 Hello, everyone, and welcome to hacking and defending Kubernetes clusters. We'll do it live. You've got myself, James Cleveley-Prance. I am a security engineer at Control Plane. I lead our penetration testing services, purple teaming, deliver training workshops, and facilitate CTF events and scenarios such as one we run at KubeCon events like this one. Hi, I'm Fabian. I'm a security architect also with Control Plane, and I'm passionate about all things automation and security, especially supply chain security in the last year. I spoke at 6.com and some community days and meetups in around Berlin, where I'm also located. So, what are we going to talk about today? We'll do a high-level overview of what threat modelling is and why it might be important to you, and then we'll go through six exploit scenarios at pace. In each of those individual scenarios, we'll have a little bit of a look at the impact of the attack, and then we will map through to controls that we would use to prevent against some of the techniques that we demonstrate. And have a little bit of a discussion about those mitigation strategies. Then at the end, we'll summarise and have a look at what we've learnt in the session. All right. As James mentioned, let's first define the term threat model, or the exercise of threat modelling. So, what is this? In a threat model session, you should really define and also address the risks of your system, right? Why would you want to do this? Of course, you want to find security issues before someone else does. Hopefully, before it goes into production. So, to prevent this, we want to do this as soon as possible. So, if you have a new system, for example, do a threat model session as soon as an architecture is available, or even an architecture draft can be used as input here. If you have a system that is to be extended, this is usually done with the means of a user story or something similar. So, you can use that piece to think about new risks being introduced into the system and how to mitigate them. So, who should join these threat model sessions? This is a group exercise, so you should really bring everyone to the table, your developers, your SREs who know how to run this stuff in production, bring your security folks and everyone with domain knowledge or special knowledge about the system. Everyone can bring insights to the table and you find more risks together, and you find more appropriate mitigations together as well. Really, this is just a quick overview. If you want to learn more on how to do threat modelling, Rowan gave a great talk at the previous QubeCon. So, check out his talk for an in-depth discussion of the topic. So, there are great many resources out there to assist you in your threat model session. We will be using today the Microsoft threat matrix for Kubernetes. It presents you with a lot of tactics or entry points an attacker could use to gain access to your system, but it also presents you with the mitigations to take to prevent this from ever happening to your cluster. All of our six scenarios are based on the threat matrix, and we will reference the same mitigations they do as well. Of course, there are more resources out there specialized, for example, a container matrix, or the CNCF financial group, for example, worked on a set of attack trees, which are really helpful to think about or get in the mindset of an attacker, because usually it needs more than one vulnerability to really take over your system or access your database. So, you need to follow a whole path of attacks, and these attack trees visualize them very nicely. So, scenario one will be about discovery and initial access, same steps an attacker would also take. But before we dive into the scenario, let me set the stage briefly. So, we will run all the scenarios on a QubeADM-based cluster, but all the steps will work on any type of Kubernetes distribution. So, you can run it on a cloud provider managed Kubernetes, such as a GKE, for example, or you can run it on your developer laptop with micro Kubernetes. I really encourage you to take the same steps. It's very powerful to see how an attack plays out. You can watch your system and see which parts are affected. We've done this before, so that's fine. So, first, nothing up my sleeve, right? So, if we look at our cluster, there's nothing deployed with Helm. If we look at the namespaces and the pods, we can see, as I said, this is a QubeADM-based cluster. I deployed the WeaveNet-CNI for the networking component, and now I, as an administrator, am tasked to deploy Jupyter Hub. For those who don't know, Jupyter Hub is an application. Sorry, it's fixed size. The next ones will be better. Jupyter Hub is an application to run Python programs in an interactive manner, so it's very popular with scientists, data science, NLP, and so on. I, as an admin, just follow the docs of Jupyter Hub, I deploy the Helm chart, and have it running in no time. So, now we step into the shoes of the attacker. Attackers continuously monitor all the routable IP addresses of the internet, and they do something called port scanning, right? So, you can do this on your own with a tool like Nmap, and here we scan the ports 30K and up, so these are the node ports, and we can see that every external participant in the network can discover the open port, 31903, and if we check, this is exactly the port that is advertised or used by the public proxy of the Jupyter deployment. So, if we use the same information, the IP and the port, we can go to a web browser, browse there, and because it's a new application or freshly deployed application, default credentials of course work. So, Jupyter Hub is a very powerful application, and it makes it super easy for both the devs, but also attackers to directly launch a terminal from your browser and get execution in the port. So, getting initial access to a system is quite easy, as you can see with just a few misconfiguration steps. Maybe let's recap briefly. So, the admin deployed as per Jupyter Hub the Helm chart, they enabled the node port service, let's face it, setting up a load balancer is kind of pain in the butt. They of course didn't change the user and default password, and they were asked to deploy a really powerful application with lots of permissions. So, mitigations for this, right? You should restrict your traffic via network segmentation. Kubernetes uses a lot of ports, not just node ports, to basically manage the control plane or the backend of Kubernetes, and you should really lock those down, only make them accessible to those components or user, users who really need the access. And secondly, as with every application, you should use strong authentication, delete default passwords and users, this is not specific to Kubernetes, and even better use two-factor authentication. If you for example use this load balancer, you can do this in a single point and even set up things like IP allow lists and so on to make your system even harder against the text. Now James will talk about persistence. So, once our attacker has that initial access, it may be found a vulnerability or something exposed it shouldn't be, they're going to want to look, see what they can do and enumerate their environment. So for, I hope I can spell, there we go. How's that size at the back? I know I have no brightness control, I don't know if any of the tech guys can dim the lights maybe at the front. Thank you. Thank you very much. Hopefully that's a bit better. So for this scenario, we're getting our initial access through a leaked secret. So something by the sound of it that you're familiar with, but this is in this example, leaked through a Git repository. We can see a Git commit here. This could be in a public GitHub repo, maybe even a file in an S3 bucket. I'll leave you your imagination. But from the context, the attacker can see that the key in the config file is called token, gives us a little bit of a hint. And through previous enumeration, the attacker knows that this environment has Kubernetes clusters running. So the attacker makes an educated guess and here we set up the kubectl binary to use the token against the API server of the Kubernetes cluster that we have previously enumerated in the environment. So we can see tokens being used by the CLI arg down here. We're using the server IP address of the API server that previously enumerated and ignoring the CA certificate for now. So we'll set this up as an alias just to make the next commands a little easier and issue a version command just to check basic connectivity. And we can see with the server version returned to us, that gives us initial indication that at least our basic connectivity is working. And then the attacker will go ahead and will likely enumerate what permissions they have. So I'll have to shrink this down a little bit to make sense. Hopefully you can still read the gist of that at the back. But here we can see the permissions associated with this token which are largely default. So these top lines are default permissions that are available to tokens for querying what they can do just as we're doing now. And these non-resource URLs are default as well which is kind of just discovery based. The interest are the pod permissions, deployment permissions and namespace permissions which are definitely not default and have been added to this token. So the attacker is then going to use these credentials to start enumerating cluster at which point and start trying to make the most of those privileges. So here the attacker is running a benign command, ID in a container, check in the output just to check that command run successfully. And then after that the attacker may look to create persistence at the Kubernetes level. So here using a slightly different resource type of deployment which affords them a level of persistence you could say just as the workload is more likely to be recreated if for any reason it was killed off. So you can see here the creation of the deployment using our benign ID command. We'll check that the workload is creating and that the command has run successfully. And then demonstrate if for whatever reason the pod was killed it's recreated in the API service giving us a slightly better chance of persisting in that cluster long term. So what have we witnessed here? Our attacker has found some credential externally, not in our infrastructure or in a public facing part of our infrastructure has managed to execute commands and got a foothold in our infrastructure so this may allow them to do further enumeration and maybe exploit other things in our infra. How do we mitigate against this? So the first option here is restricting access to the API server on a network level. At the very least we can implement an IP allow list restricting access to the API server. Even better would be putting it in a private network that only we had access to. Further to that we can restrict what sort of things attackers are able to run in our clusters through gating images through a policy framework such as OPA Coverno or if we're using private networking it might just not have external access. And then in addition to that we want to restrict people from running workloads with any additional privileges and potentially escalating escalating their privilege to something that is undesirable so we can again use one of those policy frameworks or something like POD Security Standards to restrict these. So talking of privilege escalation we're looking at that in a little bit more detail in this next scenario. So in this demo the attacker has landed in a privileged container so this is either the attacker has managed to create a privileged container and exec into it or they landed in one through a vulnerability in a workload. And here they're enumerating and the attacker has found that there are unmasked devices in the container and is enumerating through those devices to see what they are. So here the attacker has found that the one of the devices looks like a primary file system for a VM and so they're mounting that file system into the container and verifying the contents of the device. And here we can see it looks like a root file system and is a good chance that it will be the host OS. So at this point attacker may look to jump directly into the host which could be achieved in a number of methods. Maybe a malicious cron job back during bash RC but in this instance we're going to drop a malicious or a key SSH key under our control into the authorised key file of the root user. From here the attacker would be able to SSH into the VM host and completely bypass containerisation. So what have we seen here? The attacker has started off in that initial foothold and has privilege escalated onto the VM which may allow them further escalation or access to further resources for abuse. What can we do to prevent some of this some of these types of attacks? So really there's only a few limited use cases for use of the privilege container. They should absolutely not be used in day-to-day workloads. So auditing our clusters to make sure we're not using these kind of privileges as much as possible is really important. The recently introduced pod security standard is an excellent way to prevent these kind of breakouts. Just because the baseline default prevents a lot of mischief and as it's built directly into Kubernetes is quite easy to use. I'll hand back to Fabian for the next scenario. Thank you James. So next scenario will be about authorised or in this case unauthorised access. It's not a feature. That's definitely a bug. So in this scenario we're deploying just a plain old pod into the Kubernetes cluster. As you can see here, there's no magic in the spec. It's just a container image getting deployed. We're using an image with kubectl in it to interact with the Kubernetes API. You can exchange this for any other tool that can do a network request like curl or roll your own. So once we have the pod scheduled to be deployed we can extract into it and as we can see without having any additional things in our pod spec there's a secret available in the pod at VAR Run Secrets Kubernetes IO and that's the service account. So each pod in Kubernetes gets the service account to talk to the Kubernetes API and as we did in a previous scenario we can now use the CANI command to basically list all the permissions we have with that token. We can see it gives us a lot of permissions to query additional information about the environment so an attacker could use this to for example query all the API resources we have in the cluster and we can see on the bottom right-hand side for example the tecton, the CRDs for tecton are available in the cluster so this would give an attacker an idea to look for further vulnerabilities look at known vulnerabilities in tecton and leverage those to continue their journey through our cluster. It's the same for the API versions we can also query those with the token we are not allowed to query pods or secrets not even in the same namespace so it's somewhat limited but still has some value to an attacker. So what can we do about this? There is an additional flag in the pod spec we can set which is called AutoMount service account token which by default is true so we get the service account token or we should definitely set this to false for every pod that does not need to talk to the Kubernetes API as we can see when we exit again into the container there is no more service account for us to use and therefore we don't have access to these additional information so other way around now we have an application that actually needs to talk to the Kubernetes API and instead of using that default service account or extending it we should really create our own service account so to do that we create the resource of the service account for our machine user we create a role that is allowed to get and list secrets in the default namespace and we attach those two resources together with a role binding and in the pod spec now we can specify the service account name and this service account will now be made available inside our pod again we exit into the pod and with kubectl for example we can see that we can now query the secret but we can also list all the other secrets in the namespace which may or may not have been intended but we see here that for example credential available in the same namespace which we can now read out so let's recap briefly so every pod has this default service account and we should really disable this via the shown config field and we again should use least privilege principle so even when we define custom roles for our services to use make sure to really lock them down as much as possible you cannot only list limited to the namespace but you can also limit it to specific resources, the verbs so pay attention to all those things to only give the required amount of permissions so next scenario is about poisoned image in the container registry and for that scenario you do like the gifts a lot more nice to see in demo 5 we can now use the docker credentials we discovered in the previous scenario to attack the container registry of our Kubernetes cluster so we can see here for example we have a deployment in our Kubernetes cluster which is called awesome cats which is hosting some cat pictures seems like the perfect target for me we can configure the docker credentials we found in the previous step and use a tool such as crane which is very powerful to do weird operations on container registries you can standard scenario list at least the versions of an image and we can also fetch the digest so we can see here there are four versions of this container image available in the registry and latest is always a nice target for an attacker to see and we will see in a second why so we will grab the digest of that latest image which is like the fingerprint behind that image and we can use another docker file to basically now extend the image we have discovered so we start from the image but then we eject whatever code we want we hide it in it as H and we extend the entry point to execute our script before like the previous entry point is called again so we are not raising any suspicion by breaking their service so after building and pushing that image we can refetch the digest and we can see that this has really changed so this was 8.8 before now it is 3.0 and we have a different image behind that latest tag we just overroated so what you could expect is that our code is now running on Kubernetes but that's not the case because we have a command executed this is because Kubernetes is not like watching the tags in the registry it is only downloading the image if necessary so we as an attacker need to force the pod to get redeployed usually you find a lot of vulnerabilities for web servers where you just crash them so the pod dies, gets rescheduled and the image gets downloaded again I will simulate this by just scaling down but you can find those malicious payloads easily to crash a lot of web servers so once the rollout was successful and the pod is rescheduled we can query the service again we can see the service works fine no one is checking or double checking but we can see that now our code is executed and we can do whatever we want in the system alright so what have we witnessed is the leaked container credentials and the lessons learned here is really we need to protect also the external cluster dependencies right usually you have container registries but also config repositories if you're doing GitOps we have deployment pipelines and secret managers and all of these have power over our cluster they can be abused by an attacker to gain some sort of control of what is happening inside our cluster so specific mitigations for this attack is really to pin by hash so any kind of attack that can be overwritten by an attacker is a nice target if an attacker adds code to your container image the hash will change so now if you pin by hash the attacker has two problems they need to attack the image but they also need to attack the deployment file and change the reference in there to get them malicious code deployed which makes it even harder for them going one step further we can also use container signing so only deploy images that are signed by us if we do that the attacker now also needs to gain access to the private key and sign the image before it can get even scheduled in Kubernetes and lastly we should really use short lift secrets we heard this a lot but if the credential for the Docker registry is long lived the attacker also controls this attack vector for a very very very long time if we make those short lift the attacker really is is on the timer to get that attack out there and maybe we can lock them out before they can get to the next step all right scenario 6 is about evasion and James will speak about that so once we've gone or the attacker has gone to all this effort to get access to our clusters maintain the existence there they may want to avoid being detected by the administrators so we'll have a look at that in this scenario so at the Kubernetes level one thing that we're given is a resource called events which is the same as any other resource but it is kind of a meta resource implemented by Kubernetes to give us information about what's happening in the in the control plane and in the cluster so we can see normally this has a whole load of information about what's happening so for example we're looking here at fairly vanilla cluster but even then there's a lot of information we can see information about cube proxy some control plane components here as I said we're in a cube ADM based cluster there's some calico stuff here that's the CNI we're using there's a whole lot of information and if we're a malicious actor there'll be some information in here about the workloads that we've been creating and what we've been doing to the cluster which is potentially undesirable so what we can do is just delete the information and as shown it's all gone easiest that so this is a little bit suspicious I don't know about you but if I came to my cluster tomorrow and saw no events whatsoever I think something very fishy would have happened so a more skilled attacker sorry may go through and selectively delete events but on the Kubernetes level this would be a great way to start avoiding some detection by administrative and security teams so one of the issues here is that this attack is a little distribution dependent so if you're running in a cloud provider it's more than likely that your events will be shipped off to the cloud provider's logging solution so therefore deleting event logs in the cluster won't delete them in the logging solution so it's not quite as useful in that scenario but it's definitely worth checking because it'll depend on the setup if you're in a provider that manages the control plane for you they'll have made a decision about how they handle logs so it's dependent on which provider or distribution you use as to what they're doing there as to how effective this would be again looking at the level of privilege an attacker needs to do this there's very few scenarios in which a user needs delete events they should really only be restricted to be like cluster admin I'm really struggling to think of any legitimate use cases for this and so yeah reviewing these permissions to make sure they're not unnecessarily distributed as key so looking back at all these scenarios what have we understood today so hopefully some of these scenarios have helped you understand why some of the best security best practice exists and what the impact of some of these exploits is hopefully you'll be able to take some of the resources we've linked to today away from this session and that will help you some of your threat modelling exercises going forward and hopefully we've communicated some of the importance of threat modelling to you to help you contextualise and prioritise some of the risks you or your org may encounter so we realise this is quite a rapid run through a lot of the topics we've talked about today if you have any questions about threat modelling or any of the exploits or scenarios you've run through please feel free to ask a question or drop by at the end of the session and ask us more and we're more than happy to chat about any of it to you thank you we're running there's a couple of sessions of our threat modelling available to those that are interested where we're running 20 minute sessions as a company you're more than welcome if you've got any organisational problems although there's I think a couple left this afternoon so yeah come and chat afterwards also if you're interested we have a booth in hall 5 and jobs available thank you