 Welcome to my talk today. This is tweezing Kubernetes resources operating on operators. So what are we going to talk about today? Primarily Kubernetes operator security. I'm going to briefly go over a little bit about Kubernetes operators, talk about security. I'm also going to talk about what does the operator introduce into Kubernetes, how can that be abused by an attacker and also how can I now review that from a security perspective to see whether there's any kind of risk associated with it. And then lastly, how can I detect operator abuse? So I'm Kevin Ward. I'm from control plane. I'm a senior security engineer. I've got over a decade of experience working in the security domain doing all sorts of things from threat modelling to security architecture. I have also do pen testing where I can. My mantra is I like to harden by day because primarily I'm working on securing stuff. And then by night, I like to sit in the dark and hack stuff and see how I can exploit it. So essentially, Kubernetes operator is an extension of the Kubernetes API with some operational knowledge that operational knowledge is provided by custom resource definitions, which extend the capabilities of Kubernetes beyond its initial primitives. There's a controller that reconciles those resources for you, utilising all the magic that Kubernetes has. These are primarily used, or often a use case for it is stateful resolution. So you might have a database that you wish to update. You may want to drain the data out of it. You may want to rehydrate it. This is usually a very good pattern to try and keeping it in sync with your cluster. There are some operator tools that exist out there, primarily through the operator framework, which offers several little things that you can do. The operator SDK provides a scaffold which automatically generates code for you. You can bootstrap a project really nice and fast. So it's super useful for developers. The operator lifecycle manager essentially installs updates and manages your operator for your cluster. It does so via an operator bundle, which consists of a cluster service version, which contains everything that an operator needs pretty much or bundled up, but a little bit of custom resource definitions. Interestingly enough, the actual OLM consists of two operators in itself. So there's an OLM operator, and there's also a catalogue operator. Last on the list is operator hub, which is essentially just a community hub where you can submit and share your operators. So what does an operator introduce? Well, we've talked about custom resource definitions and what they do, but they can basically be anything that you would want them to be to run your cluster to provide the operational knowledge. They also introduces a custom controller, which essentially is just the container that gets deployed normally in a dedicated namespace with a dedicated service account, but this isn't mandatory. So you can leverage existing resources if you wish to, which is a bit scary and we'll go through that a little bit later. And then alongside all of that, pending on whatever you would like to do with your operator and how it's going to work, it could also introduce further Kubernetes resources and resolve those, but also extend further out of your cluster and actually start configuring cloud resources, which is quite scary stuff. Lastly on the list, it usually introduces some logging and metrics as well just to give you some observability. So what can go wrong? Well, from a key threats perspective, service account permissions. Now, operators usually trying to do administrative tasks and to do so, as you probably know, they're highly permissive or certainly have highly permissive service accounts. If the operator gets compromised, you can leverage those permissions to pretty much do quite a lot of stuff, getting close to cluster admin. From a deployment of the actual controller itself, these can be a privileged container as well, just so they can access and reach the resources that they need to. And also, they're just as susceptible as any other container or image to vulnerabilities or dependencies that would reside within it. But I think the biggest thing from a workload perspective in Kubernetes is the scope of an operator because not only it could just be namespace bound to a specific namespace that you would want to resolve your resources or how you want to do your work, or it can go completely cluster wide or multi-cluster if you really want it to, which again extends those permissions. But even further still, it does like it can be externally bound, so it can start resolving those kind of resources. Any of those permission sets if the operator gets compromised can really, really take an attacker to the next level. So one of the things that I worked on was to try and understand the attack path through. So I decided to do a micro attack-like matrix for operators. This is now live on our repo. If you want to have a look at it, I provided all the verbiage assigned to it. I think significantly, there is very little kind of access, initial access for an operator. It's not normal for an operator to be kind of like public-facing. So normally to try and actually get into an operator, you would try and compromise the image either in the repository or you would compromise some credentials, whether that be some crowd credentials that you've managed to get, or whether that's the Kube config file that might reside on a developer's laptop. Alongside some of those threats, I've also included stuff around OLM. So OLM has a mechanism to automatically install an operator as part of its updates. Now if you don't have to validate that, so it could be anything. So if you're not in control of like the line of which that gets installed, you could just install something malicious without validation. You could also manipulate the catalogue also to point to an operator image repository that's not supposed to be. So there are kind of little things that you need to kind of bear in mind when you're looking at it. The items listed in bold are essentially the cloud resources. So stuff that is external scoped. So if you're doing a kind of risk analysis and you want to look through this for an attack path, if your operator isn't externally bound or accessing those resources, you can basically remove those off the threat matrix so you can kind of get a better view of what your risks are. So what's a like a common path look like? Well, adversary could steal some credentials and attain cluster access. From there, you could enumerate the pods. Maybe he notices that there's an operator in there. You could potentially exec if you could. And then maybe from there you want to start enumerating what the service account can do. Maybe the namespace is already in. If it's in a dedicated one, maybe you want to do it within the cube system namespace. Basically from there, you could leverage the cluster or bindings if it has those to deploy another malicious container, for instance, in this instance cube system. And then you can start, bit by bit, start taking over other resources with a privileged container. Granted, there's a lot of ifs and buts and other controls that you can apply within there. So a more kind of stealthy attack path that you could take is actually replacing some of the codes for an operator within the image and inside like a registry. So that could be an internal registry for the disgruntled employer. It could be like a supply chain type of attack. Now, the functionality of the operator will maybe remain the same. But what they could do then is install like a malicious sidecar. That sidecar will then be deployed out alongside other containers and then intercepts a load of the requests that come through to gather potentially sensitive information. You can then potentially then extratrate that data to an aversory control cloud account. There are several CVEs that you can kind of find if you look for them around operators. I think significantly they're not always just about the operator though. So there are things that are deployed alongside the operator that affect it and how it's done. You can see that with, for instance, the capsule proxy. Essentially the vulnerability is within there because the proxy is running as cluster admin. It gave you access then further on to the operator and so forth. But yeah, it just gives you an idea that operators themselves are also vulnerable. They're not immune to this. So how bad is it? Well, I decided that I would download and have a look at every single operator that exists on operator hub and assess it for what I would consider key threats. So what are the service accounts that have been deployed? Sorry, what are the service account permissions? What deployed security contacts have they got applied? And also look at potentially sensitive cluster role bindings, cluster admin, and then whether they're deploying in a separate namespace. Now this is for security contacts. Red is bad. So Red is basically saying for that specific security contacts that has not been applied to an operator. Yeah, it's pretty damning. And I'll go through the statistics of the overall of how many has not been set. The purple lines, well, the blue lines are obviously that they are set, but the purple lines are actually the opposite of what you might want. So that is if you have privileged escalation false, you would be sending it to true. That is quite scary, but at the same point, sometimes operators are doing some quite permissive things to do. So for instance, it might be accessing and setting up a container storage interface, for instance. So next up on cluster role permissions. Yeah, there were a lot that had cluster role permissions. You can see that there are a few that had, I would say cluster admin or cluster admin equivalent. That is, I have access to every single resource and every single Kubernetes API. Alongside that, surprisingly, there's about 10% across the board would have bind or escalate or impersonate. So the breakdown of this essentially is like 90% had a dedicated namespace. Really cool. That's excellent. The only issue is that 84% had cluster roles. So it kind of almost defeats the point of having that dedicated namespace somewhat. 64% had access to secrets, and 70% were able to exec into pods. 58%, only 58% did not use, 58% didn't use security context, and only 10% of the ones that were listed on there were dropping in its capabilities. For what it's worth, the roles, I had a quick look at some of the roles as well that were ones that only had roles defined for them. 95% of them had access to secrets, and 72% of them could exec into pods. So how do we start securing an operator? Well, the CNCF working group for operators has produced a really lovely paper. If none of you have read it, I would highly recommend it. It goes through everything you would pretty much need to know all about operators. It goes through the breakdown of what an operator is. It talks through control loops, it talks through controlers and custom controllers and what they mean. It also talks about common patterns and the operator frameworks, and obviously security as well to some degree. Google Cloud has also got an excellent blog post about all the best practices that you would follow for operators. That's kind of things like using a single controller per application if you can use declarative APIs, and then also give some advice about how to set up logging and monitoring for those applications. But essentially, what does the CNCF say for security? Well, the first thing quite big tribe in there is about transparency and documentation. When you're developing an operator, it's essential that you define exactly what the thing is doing and what you're trying to achieve. In that sense, you can then start breaking down what permission sets you need, whether it needs to be privileged or how the thing needs to run, whether it needs to be externally soaked or so forth. That's a very important thing for a full operator to do. Beyond that, as I just stated, start defining that scope. They talk about cluster-wide, external, and namespace. So cluster-wide, you have access to all resources that you need, external, cloud resources, and then namespace. They basically say restrict our back permissions as much as possible, and certainly for cluster roles only use if absolutely necessary. Now, we've seen from operator hub, which doesn't contain just all the operators on the planet. That is not true, and quite a lot are being set to cluster role permissions. The other thing to bear in mind, if your operator does have access to external resources, there will be some cloud I am associated with it, and you're going to need to review that as well. They state that you should leverage SELinux, Apama, Secob profiles, I think this is quite well known across the industry, and, as always, scan for vulnerabilities and consider supply chain security. I'm not going to go anywhere near that. There's about a dozen of talks probably this year about supply chain security. From a prevention strategy perspective, I decided to start playing with and trying to build like an operator. One of the things that was quite surprising to me based on some of the research I've just done was that for 1.18.1, there is the operator SDK gives you two security contexts out the box as part of the manifest. If you see a developer who's removed that, you have to question why somewhat. It may be bound to the way that the thing works, but I think you should really question that. My advice is just be explicit. Be very careful and be very explicit about the verbs you want to use, the resources you want to access, and the APIs you want to access. Try not do star, star, star permissions. Try not to do core API, star, star. That is just to be quite lazy. The only issue is that an operator might actually require that as part of its work, but I do question that somewhat, especially when you're getting to around RBAC API because that's where, if you apply a star, then you're starting to define the escalating bind privileges. Last bit of advice is start to work with the developers to scope out your operator. Start restricting the operator to the namespaces that it's supposed to have, but not only the namespace it's supposed to be deployed to, but also restrict it to what it can watch as well. Just slightly different. Review the cluster role permissions and make sure they're not overly permissive, and then also do the same for Cloud Iron and the same for to make sure that a role is defined if you only got a namespaced operator. So wouldn't it be nice if we could do this in a kind of automated fashion? Well, Control Plane have made a static analyser for operator manifests. This is based off a lot of the kind of risks and threat modelling that I've been doing. Primarily at the moment it's only focused on security context, cluster role permissions, and initial namespaces. When I say initial namespaces to be clear, that's default and cube system of whether it's just been deployed into there. It's demo time. Right. So let's bring. So that is very zoomed. Can everyone read that? Okay. So this is a bad robot, and within bad robot you'll find all the rulesets and the rationale behind the rulesets in there. So straight out the back you'll start seeing things like whether it's been deployed or using like default namespaces, whether it's got a security context set, whether the security context are actually configured the other way round. And these are probably like the primary three things I would recommend potentially blocking a build on. So runs as cluster admin, nah. You should be using a dedicated cluster role and defining all the permissions for it. Cluster role with full permissions is just basically the same. So make sure well, it kind of isn't, it isn't. But you do need to make sure that you kind of block that as well and ask the developer, whoever is building this to redo it again. And I believe the same for the core API resources just because they're so sensitive. Beyond that, I look for things like it has access to both cluster roles and cluster of bindings in terms of entirety secrets. You can second to pod. What else have we got in here? Removing events. Custom resources is a really interesting one. You would imagine that it probably needs custom resources, but you should be declaring the custom resources that it should have access to. Not every single custom resource across the entire cluster. I think that's a bit scary for me. So let's just do a quick demo. If I can bring my terminal up. It's pretty simple. You just do bad robot scan. And then I've actually taken three manifests that are based on manifests from operator hub. I have a normalised who they are because I don't want to cause problems with people and start having fingers pointed at me. But if we look at, for instance, demo one, wow, that is quite big. Maybe we need some color as well. Right. So starting from the top, we've got some scoring here, but basically says that no security context has been set. We have we've got impersonate defined within the cluster role. We have full access to all the mission controllers, secrets, and the score designates what I believe is the severity to it. So some things around events and modifying logs are kind of low. Let's try the next one. What have we got here? So in this one, it's actually found that it's been deployed into Kube system. We've also got some stuff up here. Yep, Kube system, no security context set. And yep, we've got secrets, access to secrets and cluster roles and cluster role bindings. So for the final one, I don't want to do that. What have we got? Bear in mind, these are based off actual operator hub ones. I just want to just point that out. We have now got a no security context set is running with cluster admin, I think somewhere. Yeah, it's on a default namespace, and it's got star role cluster role. So pretty scary. So what about other threats? Because bad roadbook can only go so far, right? So what it's not doing is it's not looking at malicious kind of operator code, i code that's been modified by a adversary somewhat and kind of hiding it within that. It's not it's not looking at certainly the repository of which an operator lives in, if it's public and whether it's been overtaken. It's not looking at whether a binary is that have been included in it's non minimal. So things like shell, that pretty much I be questioning whether there should be a shell in it at all. But there are some that do. What else is welcome you think about? Yeah, whether like, whether you want to draw like an operator internally, and then maybe it's modified like local and then just deployed, you need to kind of some maybe some checks to consider around that to maybe make it more permissive. So you have some initial trust because it's come from a like a reputable source, but then someone in between is reconfigured it. So what's kind of I'm getting to the point is we need to operate pipeline. I wouldn't say that this is like the de facto because there's so many different things now you can do in a pipeline, including s bombs and so forth. But the idea is that provide some provide some developers some RBAC guidance. You should review the operator code itself to just make sure it's doing what it's supposed to be doing. Maybe then look at using something like bad robot to scan the manifest, make sure you do some vulnerability scanning. There's loads of tools that can do that against your container. And then finally really validate those that container in a test environment to make sure it's doing what you expect it to do. So what have we got for a detection strategy? Now to me, the way I view it is that an operator is essentially an automated run book. So as part of that, you know exactly what processes and procedures it's going to run. They're going to generate some events. And as part of those events, you can have like a set load of logs and events to look for when you get those deviations away from what that operates supposed to do. That's when you might want to investigate and flag. Now that's not perfect because the operator itself might be updated. So you need to make sure that when the operator is updated, the functionality hasn't deviated too much that you get a load of false positives. It also could just be doing bad stuff anyway. So you might not detect it. So for instance, you know, deploying resources that are potentially vulnerable, or it deploys misconfigured resources as well. So making them maybe public accessible or something on those lines. So for the common attack path that I talked about earlier, I considered what would a cloud provider get you out the box? Would it be able to detect those different things that are occurring? Now the Kubernetes API events logs are your friend at this point. And essentially, when I did an issue exec, you can see that it found that I was doing that event nicely down there. But when I pulled down QCTL and put it in the bin directory, it didn't detect it. I just want to add as well, by the way, it's running as root. And this is an actual operator that I pulled off operator hub and deployed. It's not detecting me doing a can I on the on the actual Kubernetes system namespace. So that's not detected. And as you can see, it's found some quite permissive settings across the wide. But it will detect me downloading and trying to deploy another malicious container as will it then know when I've actually tried to pivot to it as well. So there are a few gaps. So we probably want to think about maybe having some enhancements to our logs. Now unfortunately, there's, you know, there are some stuff that the clouds provide. I have not seen too much from AWS. Guard duty does kind of limited stuff. GCP have now enabled something called container threat detection. That will let you know when it will alert on when a binaries added or been executed, when a malicious scripts been executed, or there's been a reverse shell. Recently, Microsoft have released Defender containers. I'll be honest. I haven't had a chance to play with it quite yet. But it boasts abnormal Kubernetes service account operation detection and command within a container running with high privileges and also detects suspicious file downloads. Beyond all of that, we obviously know that there are third party solutions that can help us along the way such as cystic, aquasec, and obviously twist lock, or formally twist lock. So what does the future lie for operators? So when you try and build an operator and upload it up into operator hub, the operator goes through essentially a score card, and this shouldn't be mixed up with your open SSF score card. It's primarily just acceptance testing just to make sure the operator just works. There is a section where you can write some custom tests. And kind of what I would like to see is some just notional security tests run alongside that before you can make entry into the operator hub just so that by default some really bad settings are not set. The operator white paper talks about work being done around dynamic access for operators. And that is essentially privileged escalation, sorry, elevating privileges to perform sensitive operations. So you can imagine they want to maybe set up the custom resource that they need to. When they're performing the reconciliation and watching, it might lower those permission sets to just do that function. When you have that type of control, you can then start plying a policy engine where you can then see what permission sets it's trying to do and whether it's trying to access other resources that it's not supposed to. But one of the things I quite like the sound of, which is a very old security technique, which is anomaly based detection, the operators are just basically of doing this run book that I've talked about. It would be really nice if we could just build and train a detection engine to determine whether it's doing the things that it's supposed to do. The caveat is you need to know that all the fall, the functionality of that operator and any deviations for it. Then that goes with the updates for it as well. Basically, in conclusion, operators can do just as much damage as any other Kubernetes resource. You should really, when you're building an operator, define the core functionality of that operator. You make sure that that operator is only doing the things that it's supposed to do and you can restrict it down to the stuff that you would do. You should review the operator for scope and permissions just to make sure it's bound to what it's supposed to have. I personally think you should block deployments for default operator permissions, as I've indicated. Make sure you apply your security limits modules where possible and profile your operator with logs and metrics and alert on any deviations. Thank you. I realise I'm way within time. If anyone's got any questions, they want to ask. If anyone wants to talk to me or ask anything, please do. I believe so. My question is, do you think that the security issues within operators that are introduced upon building your operator are due to operator frameworks such as basically the people use as templating to build their own operator? For example, last year, I built a cross-plane provider, which works similar to an operator. We built that provider based on the existing providers. It was very difficult to understand what was going on, where things supposed to be, how does it work, so we just made it work based on existing templates. Do you think that's one of the main issues and do you think that other people working on that? It's a very good question. My feelings are that the operator framework is really good because it actually does provide some level guidance for you. Does it do everything out of the box? I mean, when you bootstrap from the operator SDK, there is no service account there and there's no kind of guidance for that. Potentially, there maybe should be some more to help people along the way of what to know to do. Maybe that's kind of the next thing, but I don't want to force. Essentially, what the operator framework is doing is excellent and they're doing it to a level that enables people to just download and start building operators. If you start being a little bit more, have a bit more of an opinion over those operators itself, you can start restricting people from beyond what they're trying to achieve. It's a hard balance between how much do you actually restrict someone doing to them feeling that they can't achieve what they've got to do because they've got to fulfil a certain criteria. So it's a hard balance, but I think maybe more guidance around service accounts is needed. That's my best way I can describe it and how operators should be built. There's a lot of scaffolding that the Open SDK gives you, so it's good in that sense. It gives you what you need, but from a security perspective, you really need to think about the design of what it's going to do and maybe even apply some threat modelling to understand what if. What if if I start applying these permissions, walking basically go wrong, and that's usually the best scenario to start weeding out the details? Yeah, great talk. I've got a couple of questions, actually. One of them there you mentioned about dropping permissions, and I think that's a really interesting concept. Do you know of any ways that that can be done today? Because I don't think that's something simple. You either need another operator with more permissions to manage a set of operators, and then you've got an engine x style work process thing. I see you smiling looking at it. Yeah, it really is excellent. It's the kind of thing I think we need to start moving towards, and maybe not even beyond even operators, but the white paper is quite vague about it, and I tried to do a load of research. I was like, well, what is this research going on, and I couldn't find anything. It's obviously something going behind the scenes, potentially with the operator framework or something that they're looking at. But yeah, it is going to be challenging to do, because you need essentially an entire, I think an entire system to kind of then start managing all of that. And then you probably want to run some policy against that as well, just to validate and make sure it's just doing what it's supposed to do. So it's really tricky, but I'll be honest with you, I did some really, this was just based off the white paper they were saying that work is being done. And I took, I mean, they can't publish that without having some kind of ratification, I would assume. I would assume. But yeah, it definitely is for me kind of next stages of what we can do and starting to try and tighten these things down. And then my second one is the analysis that you are doing there, and actually looking at the far-reaching permissions that some have and stuff, I think that's super valuable. I wonder, I think when it comes to people looking at something like operator hub or anywhere else, the visibility of that kind of information is lacking and not specifically pointing at operator hub here. I mean, universally, it's not particularly clear to an end user what permissions you're granting people. Do you know of anything where like either work on going or tooling, I guess, bad robot is one bit of tooling for it, but making this a bit more prominent, saying like highlighting the risks to people as they go to install things or before they do. Yeah. So one of the things that we were having a conundrum over were whether we want to assess the permission sets of something that's deployed because that's actual against something that's in a manifest that could be. The issue that we have is that we have some clients who just actually can't even deploy without having some level of security assessment done. So that's why we kind of went to the manifest side of it. There is more stuff being done in this field. I know that I think security are going to put some more guidelines around what the RBAC permissions and specifically ones that are really kind of dangerous to use. But yeah, for me, it goes quite similar to anything in Kubernetes. Only that operators are hugely permissive and they can then do so much damage. Yeah, exactly. Thank you. All right. Any more? Hey, so at the end of the day, an operator is really just another controller. So all the concerns you have here, aren't they equally applicable to the core Kubernetes controllers themselves? Do you have any insight into how locked down those are aren't? I can't say that I investigate too heavily into the existing controllers. I know that they're locked down a lot more than what these controllers are. You're open just to configure this controller to do whatever you want. So that's like that's the more like whilst I would say Kubernetes controller is bound to Kubernetes primitives and what it needs to reconcile and work on. You are right in saying that it's just another controller, but it's another controller you can you put in and you set the configuration for and you have an all the image that goes into it and things like that. You especially from a cloud provider perspective, you just that just automatically they can take care of your control plane and all the components within it. Whilst like with this, you're deploying it into a namespace within your cluster, and that is then usually accessible. And can you can do some basically some naughty stuff with it? Okay, cool. Thanks. No worries. Thanks for that talk. You mentioned a lot about the permissions that operators have and the example you pointed out was specifically reading writing secrets and a lot of namespaces. And that I agree that sounds very scary in the beginning. But if you think about an operator, like a operator that provides a postgres as a managed application will probably generate credentials for the database for you so it's usable and stored it in a secret and also needs to be able to read it. So it's kind of also like a regular use case, right? And you kind of have to give the operator permissions in a lot of namespaces because you don't know where it's going to be used, right? If you have a multi tenant platform, you can't just list the three namespaces ahead of time, you will always be like adding stuff to it. So I'm wondering, have you maybe considered that this is potentially an area where we need to rethink a little bit about the RBAC system Kubernetes itself. So it has a concept of identity, like in the Linux file system, I can write too many directories, but that doesn't mean just because I can write a file in a directory, I can read everyone else's files and just read the stuff that maybe I have created. So I'm wondering, like is this maybe a point where we need to kind of extend the RBAC system a little bit to be more granular? Because right now there aren't a lot of options. You can like say, you have RBAC on a particular object by name, but a secret will probably have some auto generated name. So you can't like, you know, predict that. So there aren't really a lot of options for developers really to lock it down as much as we would really like to have. Yeah, I definitely would think so. There's always more stuff that we can do to help, but there's always the balance between do we make it too granular and really hard, or do we, and then it becomes really unwieldy for a developer to actually configure or an engineer to configure. So it's a fine balance basically. I take your point about secrets is why the score is not that high because I understand that there are use cases for that, but there are risks associated, especially if you're going across cluster wide. Some of the patterns that they know that you can put out is just have essentially a core operator manager to manage then multiple controllers across your cluster that focus on specific things that they're going to run. And they can run maybe against specific namespaces as well. I haven't kind of gone into that how I would then do I guess toxic combinations of seeing if this has access to that and that has access to that. Those two together, can I then do something that's kind of a little bit beyond what I've gone through? I've mainly focused on that operator that is controlling the first thing that's deployed and building out. So there are different patterns that you can use to potentially break down operator permissions and what it's focused on. But I don't know from an RBAC perspective like how we could go a little maybe a little bit further with doing that. Potentially you want to maybe focus more on roles and then start having multiple roles that are then pointing rather than I have a cluster role and then it can just do all these different things. Any more questions? I think that's it. Thank you everyone.