 All right. Good morning. Hello, everybody. My name is Andrew Martin, and I want to talk to you about supply chain security in containerized systems. I also have Mike Kaczowski from Google to thank because we have been working on this together. We will present it later together, but you couldn't be in Berlin today. So I'm a founder at Control Plane, which is continuous infrastructure security and dev sec ops flavored things for containers and Kubernetes, and I've done a little bit of everything from development, database administration, pen testing, architecting, and I want to talk about supply chain security and how we can lock down things that we may not know about. So what is a supply chain? It's anything that we depend upon. So for example, in a military situation, every piece of hardware and software has to be tested against the person who built them to protect against, for example, nation state attacks. Pharmaceutical companies, likewise, need to know where every, need to know the provenance of every part of their supply chain because people are ingesting these things, and of course, kittens need their next treats. The supply chain for these kittens isn't just the hand that feeds them, but it's also the distributors of the frisky bits, the warehouse they were in before home delivery, the manufacturers, the farmers that raised the chickens, the food that those chickens were fed, so supply chains can be really long and really difficult to track. And when there are no guarantees that the upstream supply chain is competent, tracking where things came from becomes even more important, especially if we're feeding the emperor's cats. So how does this relate to software? This is grossly oversimplified, of course, but ultimately, it's any code that ends up running in production and with modern development and deployment processes, of course, with CI and CD pipelines in the middle. Software supply chains can be exploited, so bugs in libraries, the applications depend upon, for example, Equifax and Struts, deliberate vulnerabilities snuck into popular dependencies, either in the source code or compromising the hosting service or the infrastructure provider, or perhaps a compromised download, a man in the middle attack or typo squatting, which is especially popular now. It's when, for example, a library which is hyphenated is then reuploaded without the hyphen, so there are two parallel versions, people can happily resolve that dependency, it's got exactly the same code, and once it's got 100,000 downloads a day, they add some code which, for example, captures the environment and posts it off to a remote endpoint. That means if you're on a build server and you're deploying to production, your build keys are being posted remotely. So when we talk about supply chain security, we're talking about protecting against all of these. Equifax hack was the exploit of a vulnerable Struts library, of course, and compounded by perimeter-focused network policy that's allowed database access and subsequent exfiltration. We have also discovered recently that they had a TLS intercepting middleware box, or middlebox, rather, that had an expired TLS certificate and failed open by default. So they were doing deep packet inspection on all their database traffic, but the hardware failed open by default because of an expired certificate. Marvelous. So yeah, detecting vulnerable Struts libraries, however, would have mitigated against that in some ways. Just through scanning dependencies, these published CVEs, they were well-known. And then malicious but legitimately signed version of C-Cleaner, which is a Windows cleanup tool, was delivered to users via the official download servers sometime in 2017. The signing keys could have been compromised at multiple points in the pipeline. And a similar attack happened with Kingslayer, a signed but malicious binary was distributors. So a compromise in the weakest link is game over for users. Point solutions are not enough. TPM, get signing, binary signing, network encryption and reproducible builds are examples of point solutions. Signing needs PKI and roles. Reproducible builds assume trusted inputs. Binary signing, we still have key management problems. And an untrusted artifact deployed despite best efforts in the pipeline is, again, potential game over. So compliance is still a problem. And how can we make sure that all actions were actually performed by the right party on the right artifacts and they also produced the right results? So as we know, VMs don't offer the same portability as containers. They tend more to manual adjustment of in-place monolithic apps with configuration management. Containers make things a lot easier because of the dream of immutability. We're essentially running in production the same artifacts as we're building on our local machines, with the exception of course, so we're probably running a different kernel version. The networking is probably slightly different. And our configuration is also slightly different. But those things aside, we have a lot more homogeneity than we have ever had in time. Containers are meant to be immutable, they're meant to be frequently redeployed, but what isn't solved? Dependences, process execution at runtime, for example, also mounting hosts, parts in from the host, which is breaking out of the mount namespace, for example, certificate bundles, and the configuration of the file systems that we bundle inside the container images. So these are the theoretical stages of the pipeline. Unfortunately, security needs to be baked in, as we know, the shift left mentality suggests that we bake it in as early as possible. So what can we do at the stages of the pipeline? Start off with controlled base images, any external images, for example, things that come from the Docker hub, should be pulled, re-tagged, pushed into a local registry where they can be subject to your organization's internal compliance requirements, they can be scanned to CVEs on a regular basis, and we're not just essentially piping to bash, which is just what pulling from the Docker registry can be. Tags on image tags, image tags in the same way that git tags reference an immutable content hash of the Merkle tree, it's exactly the same for Docker image tags, the latest tag is mutable, can be changed, is not guaranteed to be pointing to anything at any given point in time. Hashes are secure, tags are transitory, and a possible risk. Of course, we need to statically analyze codes in IDE, ideally, and at this point we should be analyzing our dependencies, so this includes pulling down feeds from popular package managers like NPM, all the major programming languages have one, and ensuring that when teams are checking those for vulnerabilities and marking them as insecure in the same way that will be done for operating system packages, we actually pay attention and we upgrade those libraries. Once again, we can denormalize them into a local cache, like Artifactory or Nexus, and apply our controls within the boundaries of our organization, so not quite the same as parameter security, but rather control of dependencies. Then we want hermetic builds. This is about, again, not pulling directly from the internet, it also means no inter-build data leakage. For example, Jenkins build slaves sharing a single docker socket are giving access to all of the other build images and all of the other layers of those images that have been built, so even if one adds a secret in a layer and then squashes the final layer, there is still evidence of that secret lying in part of the UnionFS file system on disk. Of course, again, we're caching these build dependencies, pinning versions for deterministic builds is somewhat divisive because. First of all, most repositories will remove the previous version when they upgrade it to something else. It depends on which package manager you're using, but generally this means that if a version of a dependency is updated, then our builds will just break. Now, arguably, that's better because we're taking an upgrade path that we know about. On the other hand, we can trust sem-verse to some extent and there is a limit to the amount of effort that it is worth investing in these things in order to actually know the upgrade of, for example, core utils. Maybe we care about that because we've got a really heavy, batch-dependent application and they've changed the output of LS to include slashes as caused quite a ruckus a couple of years ago, but do we really care about that? So deterministic builds with containers is a question of, I think, organizational capability to actually go and individually upgrade things manually. Of course, we would be remiss not to mention bash safe mode at this point in time. It's all very well having these builds, having everything wired together in build chains, but of course, if we're running bash without exit on our without set pipe file, then we are at risk of running untested code paths. And of course, rootless builds. Docker has been heavily criticized over the years for running a root demon, which means the breakouts and code execution in the demon's context is executing on the host as roots. Rootless builds are a halfway house towards this. They mean that privilege is not required to build containers. Now, this has been really hampered by the fact that user namespaces are not really up to snuff. However, there is a new breed of tools from, in this case, Susan, Jeff Fras, and Red Hat and Google, that all more or less do the same thing, which is remove the requirement for root to build Docker images. Although the class of build time attacks that they're mitigating against are more best practice than in the wild right now. So where are we? Application image scans. We should be ensuring that we're not shipping CVs to production, of course. This is for operating system components, installed binaries and jars, tarbles, in some cases, many more things. Policy, we care about discretionary access controls. There's no point shipping these very secure images if we use set user on binaries because that is a classic old school path to explore privilege and security. We should not be shipping any secrets in code, everything should be immutable, open to scrutiny, and we inject our configuration in the classic 12-factor style, he says, without using environment variables, of course, because environment variables leak on the host, so an environment variable as Kubernetes does it, pointing to a file that you can then resolve dynamically. All this really is about making it easy for developers to do the right thing by not allowing them to use the foot guns and putting this compliance into the pipeline, making sure they have clear, obvious, deterministic error messages and allowing to remediate these things fairly easily. This is also a classic case of many heads doing different things. If your organization can centralize these sort of things, then it saves them being re-implemented on a per-project basis, of course. We are now at deployment stage. We have followed all this best practice and we've got a container that we want to push into production. If we do not use admission controllers to validate all the previous steps, and because we have the immutability of this image, we can tag metadata to it and use it at admission time to the cluster to validate that all our previous steps not only have been passed but have not been circumvented, that a malicious user with access to the Kubernetes API is not able to deploy arbitrary attack code, they would have to, one of the things that can be done here is to ensure every image comes from a registry that is defined inside the cluster. My registry.controlplane.io and what I do there is then as an attacker with access to the API, I would try and run something from the Docker Hub, for example. There's a very simple static analysis pod admission time and says you've got the wrong registry prefix or FQDN, you're not coming in. We're going to more detail on that later and finally run time configuration. It's all very well to configure these images very safely, but the major difference between a virtual machine and a Docker container is that it is so easy to undo all of the security on a Docker container. You run unconfined without set comp, without app armor, you mount all the host devices into the container, there's zero security there and it's probably the worst named flag in the history of computing. Right, so with all these things in line, we've got a theoretical nice promotion to production from the developer's machine, but we need to enforce all this governance and ultimately we have a very different security model from VMs. With containers, we have a single enforcement point when the images are deployed and we can control exactly what's in the infrastructure. Again, this content addressability and beautiful containers. This means that we can scan an image once, deploy it to production and constantly scan the same version that we have offline in a non-production environment, knowing that if a zero day is suddenly announced, we can scan what is essentially a hashed binary identical container to what is deployed in production and we don't have to reach into our production environment with those particular tests. So, of course, containers are also layer-based, which means vulnerabilities can hide in layers that are not in the merged Union of Fest top read-write layer that the container actually runs from. This is a question that different container scanning tools deal with differently. Some will scan every layer, some just scan the surface, the top layer. So, onto the actual pipeline. At this point, I would like to heavily caveat the rest of the talk by saying a lot of this tooling is either in development or yet to be proven out at any serious scale. However, in typical form, some of this stuff comes from inside Google where, as they did with Kubernetes and Istio and various other projects, they're open sourcing something that they're using as part of their internal process. So, caveats, this is not all production ready. A slight double caveat. It is being used as an appeal to authority instead. There we go. So, you write your codes, you know your dependencies, secure to reviewed, we've built ourselves, hermetically produced builds, we've scanned our images, we've enforced specific requirements at deployment time and then when something changes, we rebuild and redeploy the image to fix the issue, starting all the way from the beginning. Everything promoted through a pipeline, classic continuous integration. So, these are the tools that are playing in these spaces. There are commercial versions of some of these. We've gone for the open source ones here. Obviously, there are compromises made. Open source versions are generally slightly less fully featured. And we'll go through all of these in some detail. So, of course, we know about Docker, we know about the base image, we won't go into any detail there. And we'll start with the update framework and notary. So, the update framework is a secure distribution mechanism. It's been re-implemented in various different forms. Ultimately, it features offline root keys and the kind of global trusted root store style. It has ephemeral keys to prevent temporal replay attacks. And the idea is that it's built to resist compromise. It's used right now in a few different implementations, one of which is secure updates for automotive code. And it has been deployed as notary by Docker. So, Docker worked with the guy who authored this spec and built an implementation called notary. So, notary signs and validates images. So, through signed collections, it supports software to have relations where versions are dependent upon other versions with survivable key compromise and signing delegation. And best practice, of course, is to store the master key offline. This is very similar to GPG. And transparent key rotation is another feature. It's kind of an interesting one because I think if your organization is breached, perhaps you should admit that publicly. So, let's keep on going. This is broadly how the update framework works. It's there for completeness. But notary flow here, essentially what we're doing is we're putting a notary server inside the trusted registry. And then we are using that to validate that the keys, that the signing, the signed key that we have is the correct one when we use that image. This can be enabled with content trust on the environment variable and command line. At this point, we are ready to scan images. Now, as I said, different things are done in different ways by different tools. Some will just scan installed operating system package manager versions. Other will check file system permissions for all entities. They'll look at application library, package manifest, jars, rows, tasks, manually installed binaries, malware, rootkits, backdoors, file system profiteers, extended attributes, exposed ports, commands and entry points and secrets. But it is a mix depending on what tool you use. As I said before, this is an image divided by layer. What we actually see at the top is the thin read-write layer that then transparently presents all the other layers. Different tools do things in different ways, of course. Clare is now from the Red Hat into the Red Hat family and is due to be under some more active development. Having been stagnated for a little while, Aqua Microscanner performs a subset of their enterprise feature set and ANCOR as well have an open source version with an enterprise upsell. Enterprise scanners do a whole lot more. So Twistlock, Aqua, both add intrusion detection and a host of policy and compliance features. Really, it depends upon your organization's security posture. I would argue that IDS is always relevant as part of a defense and depth strategy. But this is the most important point. We should never be pushing CVEs to production because ScriptKid is a lot of CVEs that are published to exploits. Metasploits is all too easy. There we go. So pipeline metadata. I spoke about this earlier with enforcement at admission control time, but this is ultimately saying we can take all of this information that's produced in the pipeline and use it for policy decisions later. Now, Graphius is the external realization of Google's internal binary authorization tooling. That is binary as in a binary, not as in Boolean. So an open source project that already governs the software supply chain, which is essentially a structured metadata API. It's kind of on rails. They guide you in a certain direction. And broadly, what we can do here is we can use our image scanning tools and we can create metadata, which we then submit to Graphius, and we have a holistic view of all of our containers. We, I mean, broadly, I could go on for a bit longer about this, but I'll leave these in the slides for posterity. It's kind of how Graphius works. And here is broadly, excuse me, a process flow. Again, I won't go into too much detail because there's not quite enough time. And yes, of course, it's on rails. There are some, it's been some internal flux, but it's still moving in the right direction. There is a lot of action on the repository right now. However, you cannot chain together assertions to insert, sorry, to assert the integrity of the whole supply chain, at least not with cryptographic certainty, introducing in Toto. So this is from the same place, which is the New York cybersecurity lab, that notary came from, that the in Tuft, sorry, that the Tuft spec came from. And full disclosure, we are sponsoring this guy's PhD to develop the software. I believe in this very strongly. Essentially, it is the signing of arbitrary events. So you have some inputs, an event occurs, and you have a product. Everything then is signed with GPG keys. Those are backed off under individual identity. Instead of running PKI, a control plane loves key base, and we back off onto that instead. And this is essentially validating everything from the user's git commit through each stage of the build chain. Each stage is then signed, so we know we can trace it back all the way to the user who's been using a Yubi key, of course, so they've got hardened devices against compromise. And then we get all the way to deployment time, and we're able to revalidate each individual step cryptographically to be sure that what we think we're doing is actually the case. So this protects against things to the level of malicious internal actors, which is really an insane security boundary to even be talking about because it's a very difficult thing to protect. So we have these individuals there to find the metadata layouts, and yeah, there's again more information in these slides than I have time to go through, but yeah, we're super excited about these tools. They're very powerful. As I say, they're still nascent, they're still under development. This is a glimpse of what we think and hope the future will look like, rather than necessarily what exactly can be done today. And yeah, I will still just leave these because I've got a couple more things to get through. In total, in Graphius, it looks like going to merge because Graphius doesn't care about the security and verifiability of what it does. It cares more about the metadata that it collects. So the two projects will have an own holy union perhaps at some point soon. This is the in-toto flow, and as you can see here, we've backed everything off with an admission control webhook in Kubernetes. So at deployment time, we're able to verify what we've been doing at build time. And finally, on to admission control. This is the API server's ultimate test phase. API server receives a call, an HTTP call because Kubelet is just doing HTTP. Some of the calls upgrade to WebSockets, like exec, but generally we'll just call HTTP. We make sure that users who we are, we apply our back to ensure that they can do what they're trying to do, and then we run these webhooks. Mutating webhook will take a pod input and maybe a ladder service count, maybe it will remove something that it doesn't like, and then we validate the schema, and finally we validate the admission itself. This is the point that we can apply policy because we are able to perform what is essentially static analysis on the YAML. And then once we've analyzed it and said, for example, so pod security policy is a validating admission controller, so we can extend that to say, well, let's extract just the image name and because that's immutable and because we've got a tag, let's then go and check that against, for example, our vulnerability database, let's make sure that things have been scanned, let's check it against Graphius to make sure that it hasn't got any critical CVEs and of course that response from Graphius will change with every new CVE that's released. That may be deployed in an image so you might be able to apply version X of your image one day and then the next day because there's been a CVE released, exactly the same deployment will not work and you'll have to go back to the beginning of your pipeline. Ideally, you wouldn't find out at this point, of course, and you'd be scanning constantly earlier on. Critis is the admission controller for Graphius and again, it is still only, it only works with Google's internal metadata API right now but it is, again, rapidly being open sourced and Portieris is from some of the guys at IBM which is an admission controller for Notary. Now, this only supports a small subset of the total hubs, let's say, because people can run privately. The PR for using private registers landed last week. So this is rapidly again gaining maturity in summary, it's really easy to get these things wrong and it's really easy to get anything wrong in the whole supply chain. That guy escaped with his life happily. These are the tools that we think will be of use over the next probably three to nine months as they mature and with that, thank you very much for your time.