 My name is Andrew Martin and I am going to talk about OCI container builds, all the fun that we can have with them and it is quite a privilege to have many of the core contributors to some of the tools that I'll be talking about today present generally at the conference. So I will mention a few names that I would then recommend you go and see those individuals talks. So hello, I am Andy, a lover of all things breakable. You could say I'm a build fanatic and an advocate of continuous everything. I'm a founder of Control Plane which is continuous infrastructure and security practices with a focus on containerized deployments and I want to talk about containerized builds and how to attack and defend them. So we're going to talk about building stuff, building stuff that we can trust from source material that we have no way of trusting farm to table. We are going to talk about the properties of safe OCI build systems, ways that we can attack and defend them. I look at the current tooling and untrusted builds. Is it ever safe to take arbitrary code and build it into a container? And with regards to containers in general, people have been speaking about this for years. Hopefully it is well ingrained in your minds at this point. What is the heinous root of all evil? Unnecessarily running processes as root. You wouldn't do it on your bare metal host, why do you do it inside a container? We can see here the wrath of the gods as they pass their judgment down to us. So running as root is not vulnerable in terms of configuration in itself but it does provide a malicious pivot point if any other layer of containers defenses are compromised. How could they be compromised? By the privileged flag, by sharing namespaces in the kernel and the reason privileged as a flag or as a boolean is so bad is because it turns off set comp and app armor if they are enabled on the Docker daemon initially, of course they are not enabled by default on Kubernetes use of RunC. It busts out of most user namespaces and gives you a semi-fo PID namespace. It allows visibility and access to the host's devices from inside the container. At this point, I'm sure you can think of multiple ways to root the host from inside a privileged container. It is a very bad day for everybody. It also turns off all capabilities, well grants all capabilities. So there is always a more nuanced and sensible way to give privilege to something than just using that flag. We can see root was a prerequisite for the RunC breakout earlier this year. And so in order to generally enhance our security posture and defend against the unknowns, zero days, et cetera, we want everything to be rootless, both inside the container at runtime for defense in depth and to mitigate the ultimate lack of user namespaces although again, we will hear more from individuals this weekend. But also outside, our container runtime should not be privileged in case of a container breakout and then we have a process tree to follow to execute things as root on the host. And if we are running our container runtime as root, the inmates are running the asylum. We're talking about DockerD, ContainerD and RunC all enabling the execution of untrusted code but running as the root user on the host. So rootless container builds, this means that the process building the OCI container image does not have access to a root-owned process or demon on the host or ideally any subsets of root privileges. This is useful for building untrusted images, for example, in cloud or paths environments and also for defending against malicious supply chain components. Generally, we don't know exactly what our dependencies do at the application tier and we certainly don't inspect our transitive dependencies unless we are using specific tooling to notify us of vulnerabilities there. So building without the root user protects us from a class of privilege escalation attacks. Of course, everything in Linux is a file and discretionary access control to those files is specified by users and by groups. So running inside a container as root without a user namespace means that for any other namespace that is disabled, we are moving closer and closer to root on the host. Now, we're wrapped in security modules and various other layers but we are focusing on defense and depth, the many layers of the onion. Some build tools will create a user namespace to get around this problem. Others go to great lengths to avoid doing so. So what is the big deal about user namespaces? They are the perennial Linux kernel security frontier and a difficult piece of code to write because they spread across so many touch points in the kernel. In a user namespace, UIDs in the guest map to different UIDs on the host. That's ultimately the point. So root in the user namespace has UID 0 and full capabilities but obvious restrictions apply. And again, my personal favorite of mine that is as a patch set almost shipped, the LXD guys are here. They have been working on shiftFS. They have taken a patch set that was languishing for a few years and brought it closer and closer to fully productionized usage. Again, I would urge you to go and see Christian Browner's talk tomorrow and examine the issue further with him. So what that means, that is dynamic remapping of user IDs similar to user namespaces but at the file system level, a different approach with different compromises, but exciting times. So we want reproducibility in our builds as well. This stops people tampering with our build artifacts. There have been real-life compiler attacks such as Xcode Ghosts. An attack on the Xcode IDE that embedded malware in the compiled binaries and artifacts that were generated. This was essentially remote access control or rats embedded in those binaries and the Win32 induction virus alters the Delphi compiler's output for those of us who remember rapid application development with botnet enabling code in that case. This may sound familiar and the fix for this class of attacks is reproducible builds. So what are the properties of a reproducible build? A build is reproducible if given the same source code, the same build environment and the same build instructions, any party can reproduce bit-by-bit identical copies of all specified artifacts. Let's start with operating systems compiling the same package in two different locations has for many years meant non-binary identical artifacts. This build time non-determinism makes reflections on trusting trust difficult to disprove. We don't know the ultimate hash of the artifact that we expect, so how do we know nothing has changed between two builds? And fundamentally, that change could be either the source code or the build tooling that we used to build our artifacts could also have been attacked and compromised. The premise here is that if we don't trust our build tools, we can't trust their outputs. Projects like Debian have been moving towards fully reproducible builds for all their packages. This creates an independently verifiable path from source to binary code. There's a lot of work going on at the distribution level and we want to be able to extend this to container images and open source software that we pull from untrusted locations such as GitHub, NPM, Maven Central, et al. And especially compiled artifacts because they are most difficult to introspect once they've been delivered into our organization. The ultimate goal here is to be able to run the same build in different locations and get the same output. This is fundamentally a question of trust and we should trust these men. From our build tools and environments through our own software and to the packages and libraries consumed in the build and runtime of the system, the trust goes all the way back to the identity of the user committing the code. GBG sign your commits, of course. But what happens if, like Ken, we don't trust our build tools and supply chain? Well, we can rebuild the same package in multiple locations. If our builds are deterministic and reproducible, we can hash and sign the outputs and then compare them to other build farms. Any non-matching builds are then subject to scrutiny. Attacking this system requires either a shared supply chain attack or individual assault on each node or farm, raising the difficulty bar significantly. This is an independent verification of trust in the things that we're concerned with. The build tool and the artifacts that it produces. And this process significantly de-risks that component of our supply chain. Combined with tooling such as Intoto, which is the project that the Debian reproducible build project utilize, this requires GBG signatures for each stage of a build. So we can run duplicate distributed builds and check the output signatures as the code is built to give us some certainty our code is not compromised or that all our build environments are equally compromised. So to achieve this property of reproducibility for OCI image builds, we require local or pinned dependencies. If we don't pin our dependencies, then we are pulling the latest version of whatever operating system package or application dependency our manifest requires. Non-deterministic, sorry, no non-deterministic network calls or non-determinism in general, perhaps easier said than done. And of course, an identical product every time. No time based behavior or outputs, identical output ordering. So maybe locale sorting is important there and bit for bit similarity. Finally, signable and tamper proof output. Of course, as the OCI image is addressed by a SHA256 of its contents, we can easily see when any of these conditions are unmet in an individual OCI build. However, non-deterministic actions like network calls can't be identified in Dockerfile run commands. And so require a user to add things like check thermal hash validation for downloads in line that will fail a build on an incorrect hash and essentially trick the Docker image cache into being reproducible. Speaking of Docker images, again, another talk that we'll have tomorrow from Alexis Sarai on OCI v2. The v1 spec is a tad ropy in places. It was derived from the state of Golang as of version 1.6, I think. I'm sure he can correct me. And it includes some specific bugs from that version of the language, especially to do with the tar implementation. OCI v2 is underway. It contains some nifty fixes, not only for reproducibility, but a proposal for rolling hashes for better image caching and content distribution. Watch that space. The talk is tomorrow morning, I believe. And there's also an interesting proposal from Brandon Lum for encrypted image layers built into the OCI spec as well. So onto the final property, hermitism. Related to the practice of hermitage, seclusion, isolation, independence. In container terms, our builds should not impact each other. Leak states or indeed be knowable by another build. And we shouldn't rely on things outside the build contexts. Hermitism also makes building images in multi-tenanted environments practical. This means that we can share build farms across projects with similar trust boundaries and reduce the cost of running build workers and generally reducing CI cycle time. Okay, how do we attack a build? If the dockerfile is untrusted, malicious commands in the run directive can attack the host. Or, despite a trusted dockerfile, a malicious or compromised image specified in the from directive has access to other build secrets. Or a malicious image has in the various on build directive. Or with docker and docker, because that requires the privilege flag for the nested docker to perform host privilege operations, we can then exploit the privilege chain to break out onto the host. That's what we're running in a privileged container. And of course, we can get out of the container in all the traditional ways by attacking the kernel's CI school surface. So, to protect our builds, what do we do? We can prevent network or internet-bound egress. Why is this needed anyway? It's better to pull our dependencies pre-build or from a local repository so that we know they won't change and we can perform validation and scanning on those artifacts before we even pull them in to our STLC. We can isolate ourselves from the host's kernel. VMs are not really the essence of containerization, but nevertheless, certainly implementations. We're now seeing a blurring all the way from GVisor to Firecracker with container dshims all over the place. We can run run commands as a non-root user in the build's file system. We don't want access to the Docker sockets or to change global states in the container. And finally, we can run the build process as a non-root user or in a user namespace to prevent leakage of UID zero privileges from the host. We should be in as many namespaces as possible. So, do you want to get burned? This is where we can get burned. I will leave these here for posterity and move on to a comparison of the existing states of build tooling. Okay, Docker build version two called buildkit and among a host of useful features, including secrets mounting, it can run rootless. This protects the system from potential bugs of buildkit, container d, or runc. That rootlessness uses rootless kit, a kind of Linux native fake root emulator, and everything including buildkit can run in a user namespace. This, and as we will see for many others, requires set UID flag set on apps. This is integrated with Docker from 1806 and it can run as a standalone demon. Rootless buildkit can be executed in Kubernetes, courtesy of proc mount, which is in addition to security context, again, more just phrase goodness. What this does, it allows parts in proc not to be masked. Now, of course, there's still a compromise here because when we start container, of course, a container doesn't exist. It's not represented in the kernel by struct or anything tangible. It is a combination of security modules and namespaces and features, and some nice UX, in Docker's case, that give us a container. One of the many fudges and cludges and hacks that are required is preventing access to some of the kernel virtual file system APIs. So masking proc is masking specific parts in proc to prevent, for example, a process inside a container tampering with the host kernel. So unmasking proc should be considered operation based on your risk profile of your organization, as usual. So why is this important? Because buildkit uses proc to spin up nested containers. It is also likely, ultimately, to derive an entitlement model inspired by Moby entitlements. If anybody is involved with that, I would be interested to contribute and talk to you today. So this will allow fine-grained permissions around commands in run directives. And finally, buildkit has this optimized parallel step runner that is unique to OCI builders as far as I'm aware. Good job, Docker. So on to Jess Fraz's image. This is a derivative of buildkit that re-implements a number of the buildkit interfaces to be unprivileged. It still requires set UID binaries as does buildkit, and a lot of this work is just to get apt to run. What that essentially does is allow sets of subordinate user and group IDs, which is the route around that many of these tools take. As you might expect, it goes all out with namespaces and setcomp, and as it's using a user namespace, it is fully unprivileged on the host. Nice. LXC, again, the team are here, has an honorable mention as it is the old faithful of container runtimes, powering the first versions of Docker and still going strong in Ubuntu. Despite using a different image format, it does actually support OCI containers via an OCI conversion script. If you use LXC with the script, you can run containers fully rootless. This is not proliferated as much, but it's damn cool. Umachi, again, this is an Alexa Starry project, one of the original OCI manipulation tools, which takes a different approach to everything so far. Alexa has run C and kernel contributor and maintainer, and he uses some funky tricks to simulate rootfulness. The fact that he is a maintainer on those projects means that work required to get you much to run has gone back upstream, and the commonality between a lot of this tooling is that patches and changes have to be made further upstream, either run C or up in the kernel, to support new functionality required just to try and run unprivileged containers. It's transpired to take far longer and be much more difficult than we expected, but many thanks to the individuals doing the work. Umachi is rootless and doesn't require any special file systems to be rootless. It currently doesn't use any kernel namespacing, it's all VFS based, and it should be noted that a lot of tooling in this space operates by recursively churning a directory to get around ownership issues. This is an overhead, and Umachi uses an extended file attribute to get around that particular timing issue or duration issue. It's a component of a wider build tool and maybe extended shortly to watch its own builder watch that space. Kaneko is a broadly a Google sponsored project. It's used as the back end for, in fact, I say K-native build, Texton, I suppose is the successor to that, and targets a number of different build modes, notably Kubernetes, although as you can see, GVisor is a particularly interesting runtime target. It does not depend on the Docker daemon and starts a container using one of the noted modes. It does, however, use root inside the container. This is, so it has permissions to unpack an image and run Docker file commands as the root user. However, this opens it up to some of the issues mentioned previously. I will now ask people, now that I can go and target them directly, where the Kaneko was vulnerable to the Run C breakout that we saw at the beginning of this year. I'll be interested to get to the bottom of that one. And there is an open issue to do with the hardening of the runtime against malicious from images, which is a little niche, but the fact that these sort of problems are being considered is probably a good sign about the general security strength of the build system. Builder is Red Hat's answer to Docker build. It can run in various modes, with the rootless mode being preferable for our use case. Slurp for NetNS, another contribution from the rootless containers project is an interesting addition. So this is more work from Alexa and a Docker engineer called Akihiro Suda who works on build kits for Docker. Between the two of them and some others, they are pioneering the rootless containers project, which is linked later on, and patching these things back upstream. Network operations are privileged. And so in order to bypass the requirement for root to interact with these long-established kernel capabilities, Slurp creates a tap interface for network namespaces. It's a lot faster than the alternatives and is seeing general adoption as a mechanism for networking in the land of rootlessness. See also user netties. Makisu is Uber's response to the set of requirements. It is similar to Kanako's approach in that it runs inside an unprivileged container somewhere, but it goes to great lengths to ensure distributed cacheability, running Redis for distributed caching and providing local cache options, too, including expiration and a fresh hashbang commit annotation. Another Google tool, Basel, I'm still not entirely sure, much hermitism, but at great cost. Run commands do not exist. And there is much hoopla consternation, wailing and gnashing of teeth online about this. But Basel does fulfill our base requirements admirably in that it can build anything. It can do so deterministically. But at the cost of usability, which is always high with Basel, which is a great shame, as it can build practically anything thrown at it, but you will notice that things like Istio, which started with a Basel native build tool chain, rapidly moved away, the early... Well, the adopt attacks is quite high to learn. So useful tool, great project, but perhaps not well suited to encouraging contributions. Google Cloud Build. This differentiates itself in a couple of ways. It is a managed service. It is pretty damn isolated and hermetic, but probably not a usable build target for organizations with the type of security requirements that care about rootless container builds. For everybody else, this is a great option. And of course, it's reliant upon the cloud provider security model, which is not always fully disclosed, shared responsibility, et cetera. Your mileage may vary. Jib is a bit like an opinionated Basel, very narrowly scoped to Java projects that need to build Docker images, but it has an optimized cache workflow that reduces build time. Is anybody using this here, incidentally? Anybody? No, I haven't met anybody that uses it yet, but it's here for completeness. And again, do we have anybody building Nix? Yeah, yeah, we've got a few. So yeah, Nix is the ultimate reproducible build tool, if you like. Google's Nixery offers Nix as a container build service, which means by specifying on the image name, so if it's the path of the image, if you like, the dependencies you would like installed, it will transparently build that for you as the container is being pulled immediately before. So another managed service, albeit highly specific, but it creates very small images with an easy interface. So, do these fulfill our requirements? Everybody can achieve rootlessness today, thanks to WorkShip over the last couple of years, especially in RunC, but implementation-specific caveats abound. No builds are unreproducible and fallible by design, but output is a function of run directive behavior, and as such, only Basel actually achieves that without diligent use of the run directive. To expound on that slightly, if your organization is able to allot the engineers to pin all of your package installations in an apt install, generally when packages are updated, the older version of the package is removed, so your build will fail when the upstream operating system package maintainers bump a version. So the question then becomes, do I care more about the deterministic property of my images or shipping things quickly and applying controls later to determine whether or not something is malicious, for example? Again, it's a balance. Anecdotally, I see organizations do both. I like pinning everything, but it's overhead, and for a personal project, it's probably deeply frustrating. And finally, everything has a varying degree of hermeticism. Nothing is absolute, and we may well need VMs to achieve this full lockdown right now, which is essentially what a cloud provider will do for us. But why choose one? AkiHero has a container builder interface to abstract building. This is here for interest, and there are some benchmarks available from a point in time earlier this year on that repo. So what do you think? And trusted builds should be great. They are almost ready. There are attacks against everything that's not a container builder, which is great, but doesn't work against on-prem builds. And in the wild, for the kind of ultimate isolation we're talking about, people are generally using VMs around fully untrusted builds. And by fully untrusted, we mean arbitrary, potentially malicious code pulled from the internet, or potentially from a malicious insider if we care about that level of security in our organization. I expect us to get close to full trustlessness in 2019, but again, I will verify that with better and smarter individuals over the course of this weekend. So, rootless run C is already here. Usernetis is Kubernetes unprivileged in a username space, again, under the rootless builds project from AkiHero and others. ShiftFS is with us at this point. Container versus hypervisor battle continues in data centers throughout the land, but building containers in hypervisors seems a sensible compromise for mission-critical builds. And heresy, unicornals are still here, although they don't provide the most obvious container build context at this time. So, there are exciting times ahead. In the last two minutes, I will just run through two short announcements. One is CubeSec, it's static analysis for Kubernetes resources. We finally rewrote it and open-sourced it. That basically gives you a JSON output, determining the risk score of your stateful set or deployment YAML. Control Plane are looking for beta testers for a suite of configuration and test tools for cloud-native infrastructure. Please hit me up if you're interested. And finally, thanks to all the people who did the work that I am describing. Thank you very much. Any questions? Sufficiently comprehensive. Hello. I mean, it seems pretty clear that not all of those tools can survive, and if you could, like, you know, bet on a racetrack, which ones would be your favorites? From a hosted perspective, Google Cloud Build is excellent and fast and free. From a proliferation perspective, Kanako seems to be in the lead. A number of the other tools are either experimental or proof-of-concept stage. Some are customized to the requirements for an organization. For example, the tool that Uber have is built to support a massive build farm, which is why they've added all these additional caching layers in. A note with Google Cloud Build is that they've now integrated the Kanako distributed cache. So, of course, there is a kind of Google centricism going on there, where they're trying to kind of sanitize their own tooling and projects. They don't all have, I mean, for example, the work that Jess Fraz did and Builder are kind of spiritual twins, almost. So, obviously, Buildkit is probably the most proliferated one by virtue of being packaged and shipped with Docker. Kanako for building on a Kubernetes build farm and Google Cloud Build for the managed service angle. So, in classic nerd fashion, mainly just two updates or tweaks, I would also add that thanks to Giuseppe Scavano, because he actually contributed across several projects to get rootless builds working nicely. And Akahiro Sudo, he's not a Docker engineer, but a maintainer. He works for NTT Japan. Ah, thank you, yeah. I think I was aware of that from this book. Thanks. Yeah, a couple of comments. One, ShiftFS has been rejected by the upstream kernel, and they're trying to work around it by building it into the VFS layer. Basically, they don't want to have another file system layer on top, so ShiftFS is one of those projects that's always, like, two years down the road. The other one that Giuseppe's worked on is the thing called Fuse Overlay, which is basically a user space fuse overlay file system. He has some shifting built into it so that he can emulate a file system shift, so if you're going to talk about these, I think you should probably... Fuse Overlay is really the thing that's powering a lot of the rootless things like Podman and Builder and stuff like that is all based on top of Fuse Overlay. Thank you. Is that a stage to reasonably do a performance comparison? Yes, it's excellent. I did warn you that the experts were present. No, yeah. Fuse OverlayFS is getting pretty decent adoption at this point. It's even so stable that several universities have already adopted it and forked it for their HPC environments so that they can keep a main back-end store, even their own object store, and provide various academic use cases so that HPC has whatever the UID of the user that logged into the system. It's fantastic. Wow, great. Thank you. Any more questions or I'm happy to ask you questions? Okay, thank you for your attention, everybody.