 I'm Andy from Control Plane, Cloud Native Security Engineering and Open Source Security. We are sec ops for Kubernetes working in regulated industries and financial services. We espouse continuous infrastructure and security principles. I have got various other positions, very proud to be involved in the Cloud Native Computing Foundation's TAG Security, the technical advisory group. We audit and threat model projects that are coming through the CNCF in order to establish a baseline security posture and make recommendations. I also work for a charity called Open UK, along with one of my colleagues here, Dawn Foster, for advocating to the UK government for the use of open source and also the financial incentives to actually pay maintainers. We've seen projects come out of the US, such as the Open NSF's Alpha Omega project, funneling tens of millions of dollars into open source maintenance. We would like to see that here, very keen to talk on either of those subjects, or indeed the one for which I'm here afterwards. I've written a few things recently. Sands Sec 584 is a security training course. I wrote the book Hacking Kubernetes for O'Reilly recently, and we also run capture-the-flag simulations at the official KubeCon conferences. If you find yourself in Detroit or indeed in Barcelona next year, please do come along and play. This is the book. It is available to download for free at the URL at the top. That is the PDF version, which requires less re-keying, of course, than the dead tree book. What I will talk about today is very much inspired and based upon the book. There are two versions of this talk. One is a live demo marathon, which is mildly stressful to deliver and has gone spectacularly wrong a number of times. This version, which deals with the concepts, the diagrams, and tries to tie things together. There are many works on which I am reliant and grateful to my forebears and betas, and also I've linked to some of the demos that I've done in there as well. So what will we talk about today? The shared responsibility model. Kubernetes alleged and promised benefits. Threat modelling. Why is it good? Why does it give us a baseline on which to quantify and build our security controls? The question is answered with another question. The supply chain is so hot right now. What's the actual problem? We'll detail some of the ways that it can be devastating. Runtime hardening. If we're running untrusted code, can we mitigate or control for that? And finally, cluster infrastructure and topology. Where does that sit in the context of our organisation and our cloud accounts or our colo, and the individuals and colleagues that we have working in and on them? So the most important part of a successful cloud native migration is the people. Responsibilities, patterns and practices and skills. Let's dive in by looking at some of the prerequisites. Cloud native is complex. It's an explosion of abstractions. New developers or engineers joining this landscape have a lot of quick upskilling to go. And preventing the last minute security holds on deployments is important to maintain the trust of other engineering teams. So, of course, the biggest problem is a techno-socio problem. It's the people. We probably know this already, underpinning. A way to increase collaboration is by bringing security to the front of engineering decisions. This is a real benefit of threat modelling that we'll dig into later. Part of the shared responsibility model is an acceptance that an organisation's trusted workloads are running on a third party's potentially untrusted computers. So this is a cloud migration. This is renting compute infrastructure from a third party. Under this model, Kubernetes is almost always managed as a service or a platform. There's not a lot of benefit of running an unmanaged service on cloud infrastructure at this point. Cloud providers have proven themselves more than capable of running secure workloads on our behalf, and maintaining infrastructure then becomes their responsibility. Cloud providers are also shipping trusted platform modules to allow the signing of firmware used to boot hardware in the stack that we're running, which compares a hash of each component's firmware with an internal register of signatures and only permits it to boot if the values match. This is generally a higher level of assurance than we would be able to get from our own data centres. So, within our organisation, we want to threat model everything from an application's perspective. A developer only owns the application itself. The supporting tools are provided and managed by teams. The ultimate model of Kubernetes service integration is to have individual teams behave as managed service provider analogs, similar to the cloud model. Why is this so good? Because it segregates responsibility, retains uptime. Many of these components become critical to the availability of an application, and so it sets clear interfaces between the teams. Their responsibilities are clear. This means tools and policies are provided to us by operations and security respectively. We depend upon documentation and runbooks, knowledge bases, training, and teams, hierarchies, and finally the individuals that compose this, which of course are all potential points of compromise for our organisation. So, when we're defining everything as code, this then requires a different model in terms of those shared responsibilities and in terms of how that code is stored. Organisational roles converge upon a common communication mechanism, which is Git. Git allows us to harness automation, and GitOps is a pool-based mechanism for applying configuration changes to software or hardware, often on commit, or on pull request merge, where traditionally organisations have silos of expertise in specific areas of technology, cloud-native organisations strive to evolve the DevOps model of cross-functional teams, configuration, infrastructure as code, and managed services. And of course, a successful security journey requires collaboration on decisions and outreach to affected teams enforcing new controls or policies. Understanding any negative impact of certain security controls may influence the final choice. Automated verification, rotation, and pipeline automation reduces the opportunity for human error and security of systems, as we well know, and containers make automation easier and more repeatable. These processes occur in all environments with tighter controls in automated pipelines as they approach production. So, that is the level set. Let's move into why Kubernetes theoretically has these benefits. I love this diagram. It's from a research paper that's actually linked in the book, but not here. This is the lineage and the tree by which we find ourselves at the point of Kubernetes and containers. The technologies we use come from the 1950s, and we can trace Kubernetes back from Borg, Docker, LXC, Linux, Unix, Multix. Okay. If correctly configured, containers can be more secure than VMs. Dramatically increasing the fidelity of security policy, it's possible to apply to an individual process. VMs more generally work on a system-wide configuration to filter for malicious or unexpected activity. And in legacy systems, system-wide controls are generally too loose. Containers offer a new opportunity to apply filters at a more granular level on a per-process basis. And to recap, a container is just a wrapper for a process. It can be used to apply this detailed per-process runtime configuration, a trusted list of system calls, file system accesses, binaries and libraries, capabilities afforded to a process, and more. Containers are just a microcosm of Linux running multiple times and been packed into a larger system. So the nature of the security context configuration available in Kubernetes gives us this opportunity for very granular security configuration, applying policy just around the process. This highly targeted application of policy serves to reduce the blast radius or the impact of breach or vulnerability or remote code execution in a web-facing application, because the set of available actions available to an attacker is locked down and the container is limited to only those behaviors that are required in order to fulfill its default purpose. This stops, for example, an extract transform load style data workload from being able to port scan other containers or stops a web application from spawning a shell. This, then, is a representation of a Google Kubernetes Engine cluster, isolation at different layers of the Kubernetes stack. This illustrates the principle of security in depth in the centre of the containers at this point, image management and runtime security are paramount and the elements of the developer and DevOps sphere of influence. Moving outwards, layer by layer, we transition into operational responsibility in DevSecOps until we've reached the outside level. The project Googles outer isolation that is partly like a VPC in AWS or like separate accounts. So, DevSecOps, automating operational integrated security assurance. That's tricky to say. In a sense, this is nothing new. It's just sensible testing. CloudNative offers a slew of new tooling opportunities and paradigms that are informed by the lessons learned in DevOps over the past 10 years. Continuously secure delivery principles. OK. So, Kubernetes is supposed to be wonderful. It has a great deal of complexity. But it can be more secure if correctly configured, well tested and well managed. How do we know which controls to apply where and when? Well, in any threat model for any system, we first identify our adversary. In this case, it is dread pirate captain hashjack, 8-bit monstrosity scourge of the internet high seas and control planes archetypal adversary in these situations. He occupies a nation state level of criminality. That might be analogous to a fancy bear or APT 1342. These are organisations that operate in the shadowy netherworld between state sanctioned cyber criminality and private online gangsturing. And as we see on the next slide, they may be involved in anything more or less than physical intimidation. These responses are in personal threat models modulate where we're choosing to apply controls and what level of defence we will go to. And yes, he is also the protagonist of the book, Hacking Kubernetes. So, this is the adversary matrix that control plane uses on engagements with our customers. The aim here is to identify the capabilities and motivations of the potential attackers to our systems and then define our controls appropriately. So, as you can see, at the top, a vandal or a script pity, these may even be drive by attacks. There is minimum cloud native security, minimum viable, should be the avoidance of shipping CVEs to production. If we can be remotely exploited with Metasploits, we have not done our job and in fact we should be embarrassed. Moving down slightly, the motivated individual maybe becomes slightly more difficult because there may be other mechanisms that they choose to employ. But still, modifying open source supply chains is an interesting version of this. But still, we don't expect a great deal of technical complexity with these attacks. Inside is where things start to get more difficult because it relies upon very carefully modulated and configured set of internal security controls. People should have enough access to do their job and be proactive to be able to move beyond their responsibilities without breaching security criteria. But the knowledge of inside information means that this is a more difficult thing to lock down. Organised crime syndicates, this is where our captain hashjack level of adversary is. They probably have reasonably significant resources. They may have access to purchasing zero days on the black market. They may be generating their own, for example. However, as with all things, there is a limited amount of time available to expend upon a given target. By hardening our systems to raise beyond a minimum bar, we help to dissuade or defeat this kind of attack. The cloud service insider is an especially difficult problem. When we look at the Capital One hacks from a few years ago, that was somebody using privileged information. The account identifies for Capital One's accounts. They took that information, they found a web application firewall bypass with a remote code execution with a misconfigured IAM policy. So they were able to perform service side request forgery, which subsequently allowed them to dump customer information and embarrass Capital One. The origin of that attack was not only the knowledge of the remote code execution, but also that vital account information. Finally, foreign intelligence services, there is a level again of personal safety where one questions at what point would one hand over the keys to their own kingdom. With threat modelling, we have identified our adversaries. Next, we would want to look at the data flows and classifications through the system. These are imperative to understand and quantify our controls. This is where the repeatable, the claritive nature of Kubernetes configuration comes into its own as the barrier to entry for this work is dramatically lowered by reuse. Knowing which controls to implement then is a product of the threat modelling exercise and here we're mapping the effectiveness of the security controls against the ease of their implementation. This gives us an order of precedence and a suggested implementation ordering. We can see the ease of implementation for pod level network policy. It's worth doing because it's not difficult to do and it blocks pivoting and horizontal transfer, whereas something like setting up a service mesh is slightly more difficult. This is actually not indicative. Service mesh is a pod level network policy. I would say it's probably slightly the other way around. Service mesh is very difficult to maintain. Network policy is slightly lower. This is one view on how this will land for an organization of a specific skill set. I suppose the thing that I forgot from my slides is the supposition here is that network security engineering is not high on this organization's or this one developer's backlog. Then we look at something like code signing. The security benefit is much lower and the ease of implementation is a little bit higher there. We can then extract from this the controls that we would want to implement a quantifiable ordering. This helps us to balance the business value of spending money to build controls to test them so that they continue to work while the system is on the maintenance or development with the ultimate cost of the ongoing control. If developers are constricted by overly restrictive controls, they will not be retained at your organization. Ultimately, there's no point having the world's most secure system if it's not only unusable and maintainable, but also features are being shipped more quickly by our competitors. From the threat model, the next stage is attack trees. These were built by control planes for the financial services user group in the CNCF three or so years ago. We use these to enumerate the possible failure conditions and we apply controls over the branches of the trees so that we can see the efficacy of a single control in multiple different locations. Again, the balancing of controls and maintainability with the developer or the operator experience or the runtime experience or the socks experience from having those controls alert are fundamental to doing a good job in this way. There is an O'Reilly course on threat model in Kubernetes. Again, written by control plane, a lot of this information is in the public domain and open source. We also deliver it privately if you would like it for your organization. Okay, on to the supply chain. So the three four Cs rather of cloud native security. We start with the code. The code changes frequently. It is the soft underbelly of a Kubernetes cluster. It is meant to move quickly and to have bugs because we're delivering features and we're looking to be competitive in the market. We cannot have slow waterfall style releases. We're looking to operate in an agile for some definition of the term manner. So we shift left. We apply as much security testing as we can, but we also are cognizant of the fact that at runtime the file system of the code is where that sits. A container image is a table that contains a file system configuration with standard Linux mandatory access controls like the owner of the file and its read write permissions, et cetera, along with the application and its dependencies. It should be shorn of anything extraneous because as we can see from things like leaving bash in a shell, sorry, leaving bash within container, remote code execution that's able to invoke bash and spawn a reverse shell quite easily because of the virtual dev TCP endpoints. So we should remove these kind of things. Also, this is where defense in depth. We should always run IDS because nothing is infallible. The point being, the configuration of the code inside that container image is all important, but we expect it to have a degree of vulnerability because we're not assuring it over a long period of time. We're moving quickly. We're delivering business value. This is the nature of modern software developments. Then we have the container. This is the runtime configuration where we can apply discretionary access controls like set comp, app armor, system call filtering, user space interrogation capabilities. This is where we can apply that very fine and granular per process policy in order to secure the runtime of our applications. This then runs in a cluster. A container is just, as I say, a microcosm of Linux. A cluster then orchestrates many small instances of Linux across a cluster and attaches storage, gives it access to service discovery and interpod networking. These, again, are potential positions of escalation. Usefully, because we have this declarative orchestrator, we can reuse the same controls across multiple clusters. Of course, different clouds is the dream, of course. Then finally, the cloud or colo. Obviously, remote code execution into a container with AWS keys or account keys for your cloud account is potentially external remote code execution cloud account compromise. So, threat modeling, carefully delineating what can do, which process can do what. One of my passions at the moment is workload identity, which essentially removes those static credentials. But this is the antithesis of hardened deployments. We would like things separated and certainly not escalatable from remote code execution to cloud account takeover. So, within the supply chain, we have a trusted producer, a third party dependencies. This might be Maven Central or NPM or PyPI. If CaptainHashJack can insert his code into one of those dependencies, then potentially he can fire off a reverse shell. It essentially gains some form of persistence or command control within our organization. So, how do we attack a supply chain? Well, we can get on to the developer's machine and commit as them. We can infect or attack the source repository. Probably easier if that is hosted inside our organization. The build infrastructure itself, this is a SolarWinds type attack where we modify the code as it's being compiled or as it's being built into a container. We can infect a trusted supplier, which is a dependency of a dependency. Obviously that transitive chain continues in perpetuity as long as the time. We can infect a trusted package that runs its roots within the container build itself. All of these things count as things that we are consuming from a producer. And actually, if we publish open source libraries, we de facto become a producer ourselves. This is a wide-ranging and relatively difficult problem. Finally, we infect a runtime environment. It kind of doesn't matter if we have compromised the infrastructure itself. What can we do in this space? There are lots of things. Reproducible builds allow us to build the same package in multiple places. If the build infrastructure is compromised in one, it will have, with a deterministic and reproducible build, the compiled build artifacts will differ in hash. And so we can detect compromise. This is how Debian, Arch Linux, various other places, the reproducible builds.org projects enumerate these. This gives us some certainty that the build infrastructure that we're running on hasn't been compromised. SolarWinds reaction to the Russian compromise was to deploy a cloud-native dual pipeline infrastructure in the same way. Then we've got artifact and build signing. There is some nuance here, because actually it is the contents of the artifact that we care about more than the signature. But the signature will guarantee at least the freshness of the artifact that it's come from the person who controlled the key at the time it was signed. And that it's the same artifact that came from the other direction. This is possible with lots of different tools, but there's some fancy new ones here. Software builds of materials. These entered the US legislature with the Biden ordinances at the beginning of the year. This details the composition of a piece of software. It is the shopping list, the ingredients required to rebuild that piece of software. Again, it's useful when it goes to the right level of depth. These things are nascent, I'd say that, probably three or four years or longer that they've been in development, but they're nascent in terms of adoption. And there are still some intrinsic problems to do with the depth that these trees are walked to. Also, they're currently just shipped by open source software. If they're generated at build time, are they then valid? Do they match what you would then infer by running the same build of materials discovery tooling when it's distributed? What about vendor software that you can't then reverse engineer or something that's been tree shaken or something that's been been packed in some way that obscures it? There's also the question of should these be available for software as a service product? Should our cloud provider provide a build of materials for the APIs that we consume? The answer to all these things should be yes, but there's so much nuance and complexity, especially around validating the contents of these things that we are, as I say, at the beginning of this journey. A contributor analysis that Linux Foundation are doing a lot of work here, it is very difficult to unmask a contributor in an oppressive regime that may put them, for example, a journalist that might put them at risk of reprisals. It is also necessary to have strong identification guarantees to prevent collusion because two anonymous accounts could merge a pull request. This is almost intractable. I don't know where we're going to get to with this. The open SSF, the Open Source Security Foundation in the US, another Linux Foundation subgroup are attempting, well, attempted to fix this with a developer identity working group that's then folded in, I wouldn't call it chaos, but without conclusion. So there is some difficulty here. Without a kind of KYC form of identification of customer, it becomes difficult or contributor in that case. And then finally, SEMGRAP. This is only useful where we have access to the source code. We're at an open source convention today, so it's useful when ingesting third party code into our organizations to mitigate third party code risk. We can run some of these modern static analysis tools that operate on abstract syntax trees. They can find code patterns for ASTs across multiple languages and SEMGRAP is the most interesting one in the space. So there's plenty of things that we can do. The complexity of these is huge. Of course, this is where I spend a lot of my time developing things for customers as well. Without a threat model, it becomes a question of throwing the kitchen sink at something. And so, again, sort of delineating exactly what's needed is all importance. This is the nature of a reverse shell. Just briefly. The ultimate goal of an attacker is to gain command control from within the estates. I guess from a zero trust perspective, we don't like to think about perimeters anymore, but literally we do firewall things. The things inside those firewalls should then mutually authenticate at all times. But nevertheless, this is not what we want to happen. Okay, I will whizz through some of these in the interest of time. This is a little bit more book heavy, so this is my view on container namespaces and how they look in a pod. What is a container escape? If you watch the form of this talk that I did at KubeCon, I demo the dirty pipe container breakouts, that is writing into root-owned memory and firing a reverse shell back from that root-owned memory into a container, which essentially spawns a shell on the host owned by root from inside a container, which is mildly terrifying. There have been lots of container escapes. Ultimately, this is a good thing. We want software that keeps on moving and is developed rapidly. Part of the price that we pay for that is vulnerability. There's no way that we would want to move so slowly that we assure and pen test everything. Everything is a balance, of course. Part of my work with Open UK is ensuring that security professionals are applied to some of the open source projects that are critical to us all and the safety of the internet. This is a view on how a reverse shell from a dependency actually connects back to the attacker. Again, the book goes into a lot of detail on these things and I will not bore you with it now. What do we do with malicious code in a container? Pod security context and capabilities are imperative because they are the security controls exposed to us by the Linux kernel and they're abstracted all the way up from the kernel through a container into Kubernetes. They're low level security tools that we should absolutely use as our first line of defence. Image scanning, minimum viable security is preventing the shipping of those CVEs to production. IDS and intrusion detection because nothing is entirely secure and we can also do some interesting build time behavioural analysis. There are a couple of projects at the moment that we're working on. One is also to do with smart contracts and distributed computer data for very large data sets for IPFS and Filecoin. Identifying a malicious build or an anomalous build is very noisy. Again, I have to talk about that afterwards. There's a lot of things going on there. Control plane test testing. This is my personal view on how to penetration test a container workload. Again, it is just Linux, but the book goes through this step by step. We have this opportunity for next generation sandboxing. This is running virtual machines around containers or a combination of containers and virtual machines using things like Firecracker and Gvisor. These projects do insane things. Gvisor has re-implemented the Linux kernel in Golang to avoid memory safety issues inherent in C. Firecracker takes QMU and slices everything down. It boots a virtual machine with, I think the control on delete keys are the only keyboard, anything supported by the keyboard driver. They boot very quickly. They move untrusted application code one degree back from the Linux system call interface, which is written in C, which is vulnerable to everything that C is. Obviously memory correction and safety and all that bad stuff. By writing things, Gvisor is Golang and Firecracker is Rust. These are memory managed, but low level languages. They come at a cost of some complexity and debugability, which is the ultimate balance. Okay, a little bit of Kubernetes. Again, what is a cubelet? The point of this is more that it's useful in the book. But hard multi-tenancy is difficult and actually segregating our workloads with different data classifications in the same cluster is verging on impossible because nodes are not namespaced. The concept of a node is not namespace in Kubernetes, so we can have a node pool, but we can't have a per namespace node without abstractions and other code on top. Layers of security testing. I'm just whizzing through because I'm almost done for time. Again, sort of expounded into the book, but the point is all these things must be tested automatically. We also care about constraining talents with namespace policy. We should define our namespaces based upon the security controls that they offer rather than the abstraction that it's nice for an operator to interact with. And there are a few different ways to slice this particular pie. Open policy agent and Coverno are very useful. We're almost done here, so what are the risks of using Kubernetes? They are manifest. Here is that list for posterity. Again, I sort of appeal to my own authority in a fallacious way. It's all in the book and there's plenty more in there. It's always worth keeping Kubernetes updated because even though they've extended the lifetime window for each release, things do fall out of maintenance relatively quickly. And control plane do this for a living. It is a passion. We've got 50 people across Europe and Australasia. If we can be of any help, please do come and talk to me or approaches via the contact form. Thank you for your attention.