 Thank you all for coming. My name's Ethan, and I'm going to be talking about how we're signing and verifying container images in a Kubernetes environment at Datadog. So this is a talk about Datadog's internal infrastructure. But this infrastructure ultimately supports the Datadog product, which is an observability and security platform. We run at a very high scale, and I have a few numbers on this slide to illustrate that. But the important thing to remember for this talk is we run on self-hosted Kubernetes. So you have hundreds of clusters, tens of thousands of nodes, and hundreds of thousands of pods. So the solution that we design for signing and verifying images needs to work at that scale. But about me, my name's Ethan. I'm a senior self-engineer at Datadog, where I've been working on infrastructure security for about four years now. So if you have any questions about the talk, feel free to approach me afterwards, or you can reach me out the address on this slide. So why should we care about signing and verifying images? On this slide, I have a simplified model of what an internal software supply chain looks like targeting a Kubernetes environment. So it has all the components you'd expect, version control, CI, CD, a container registry, and then the various components of Kubernetes. But as an organization matures, each of these stages often serve by multiple subservices. And as the complexity grows, this means that you have more surface area that you need to secure. And each of these stages is a target for malicious code injection. So an attacker could inject a malicious payload at any stage in this pipeline, and that might reach prod eventually, allowing that code to run, even if the attacker doesn't have direct production access. So what signing and verifying images gets us is a guarantee of provenance. So if we sign an image in CI and then verify it in one or more downstream systems, what this provides us is a guarantee from the perspective of the verifier that the image that it's handling is bit for bit the same as the image that was built and signed in continuous integration. So the overall goal is to push signing as far as possible to the left, as close to build time as possible, and push verification as far as possible to the right, as close to run time as possible to extend the scope of the integrity guarantee of that signature. In the past, there have been a variety of approaches to signing container images. But the ones that have achieved the best adoption all have an architecture pretty similar to the one I've drawn on this slide. So the common property is that the registry that's used to store application images is also used to store signature metadata. And this is made possible because registries following the OCI specification are actually able to store data that's not what you would normally think of as a container image. As long as it fits the spec, you can store essentially any kind of blob, including signature metadata. The signing itself, signing and verifying itself is achieved through public key cryptography. So the signing client uses a private key to generate a signature, and then a corresponding public key is used by the verifier to verify that signature. But the way in which you distribute public keys to the verifier is largely left to the implementer. There's really no standard read of trust to connect the signer to the verifier, and there probably never will be, because each organization is going to have its own threat model and its own preferred way of handling secrets. So I've spoken generalities here, but two systems that you might have heard of that have this high level design are six doors cosine and a Notary V2, and the system that we've built a data dog also has this high level design. So we're using the container registry as storage for signature metadata. And this stands in contrast to some of the earlier approaches that used more traditional databases. So there's a few interesting things to say about this. First, I think the success of this design is largely because container registries aren't a new runtime dependency, and this makes the system easier to adopt and more reliable, because there's few moving parts and you don't have to, for example, manage yet another database. But a data dog, one of the reasons that using the registry for storing signature metadata was a particularly good idea for us is it allowed us to build on an existing replication platform. So we run a number of isolated data centers, and by design, there are very few systems that are able to transfer data between these data centers. And notably, one of those is the registry, because we need to be able to build an image in the build data center and transfer that image to all the runtime data centers where it will eventually run. So by pushing signature metadata to the same registry, we get to reuse the same replication mechanism to distribute those signatures everywhere that the verifiers will need them. It's important to note, though, that taking advantage of this fact doesn't come for free because we're replicating signatures to all the same places that images will run. We're doubling both the replication load and also the number of stored images. So you definitely need to be mindful of any registry quotas that you may need to comply with before you start signing images. So we discussed storage, but what do the signatures actually look like? The format that we developed the data dog is loosely based on SigStore's cosine format, but before we dive into the specifics, I want to give a quick caveat on why we ended up building our own. It's almost always a best practice to use an open standard when it comes to cryptographic designs, especially if you need interoperability, such as if you're signing images that will be distributed in an open source setting. However, at the time that we implemented all of this, cosine was still pretty new and there was no reigning standard yet for signing images. The sign and verify operations of cosine at the time were only available in a CLI and we knew that we would need to override a lot of the low-level functionality of those operations for compatibility with our internal systems. So I know that there's a lot of ongoing work in the SigStore community to build these low-level libraries that expose these components in a modular way. So if we were to start again today, it's pretty likely that cosine would have worked for us out of the box. So I'm going to describe the signature format from the inside out. So at the very core is the signature payload and this is what we're trying to protect. The format that we ended up using is the OCI descriptor which contains the digest of the signed image, the media type and the content size. Additionally, there's a key value map available called annotations and we use that to store and also sign the timestamp as well as the number of claims about the identity, the internal service identity of the signer. So that payload is signed using the add25519 algorithm in hash of corp vault and that signature is wrapped up in a signing envelope using the dead simple signing envelope spec. So this is a spec that was developed for the Tough and InToto projects, also CNCF projects and what the specification gets us is a data structure that combines the payload with a number of signatures as well as a standard spec and ready to go libraries for building these envelopes and verifying them. So to store these envelopes in a signature that we can push to your registry, we combine them into an OCI image manifest where each layer represents one envelope for one key and we annotate each layer with the key ID which is just a fingerprint of the public key for easy lookup during verification. So that's essentially the data structure that takes the place of a container image so we need to push that to the registry. And the location that we push that signature artifact to is a transformation of the location of the artifact that's being signed. So we add a prefix to the repository path and we convert the SHA-256 digest into a tag. And this is a little bit different than the approach that cosine takes in that we add this repository prefix. And what this gets us is a dedicated registry quota on these signature images independent from the quotas imposed upon the signed images. This gives us just a little bit of isolation in terms of the volume of images pushed versus signatures pushed. So to produce these signatures, we've chosen to encapsulate signing in a service. So at a high level we're pushing handling of keys and signature metadata behind an RPC service that is more hardened than CI is. So jobs in CI that are building and signing images use a thin client to send an authenticated RPC to the signing service over TLS. And then the service responds to that by building the payload, using a hash report vault to actually sign it. And then wrapping it up in the metadata format that I just described and pushing that back to the registry. The thin client that CI jobs are actually using is a CLI that we call DD sign. Looks pretty similar to cosine if you've used that. But we've designed it to be simple, have a extremely stable API and additionally easy to integrate into many CI pipelines. So there's not many options here to configure. Essentially to use it, you just call it DD sign sign and then the reference of the image that you'd like to sign, including a digest. So you're signing exactly one image. And we also have helpers for specific types of image builds to make integration especially easy. So for Docker built images, we have an option that pulls the digest directly from the local Docker daemon in CI. So you don't need to write complicated shell scripts to do that. And we also have a custom Bazel rule that we use to sign Bazel built images. And this approach of having a signing service and then a thin client in CI has worked really well for us. And I'm gonna touch on a couple of the reasons why we particularly like it. So first, compared to if CI jobs were signing directly with HashiCorp Vault, we have much better control of key usage and we get richer audit logs. So on the side, I have an example of what it might look like if we are signing directly in CI with HashiCorp Vault without the signing service. And the all audit logs, although they're solid, in terms of the payload, they only contain an HMAC of the signed data, which is opaque and hard to interpret. We can't, for example, filter them after these logs are collected based on the signed artifact reference or something like that. So in contrast, when the key usage and metadata handling is pushed behind a service, we can now apply a least privileges principles to the CI jobs that are requesting the signing. This means that they don't need direct signing access in Vault and they don't need to have direct access to push signature metadata to the registry. They only need to express their intention to sign the artifact to the signing service and then the hardened signing API takes care of the rest. As an example of the kind of audit logs that we can get out of this, we have identifiers available for the CI jobs that are requesting the signing and also parsed out fields of the signed artifact for easy filtering. Another benefit of signing in a service is that it encapsulates most of the complexity of signing behind a single deployment that we can dynamically update. Updating clients in CI is an incredibly tedious and time consuming process, especially when you consider that there are many branches, some of which might be behind the trunk and you can't update those directly, et cetera. So we really don't wanna have to update clients in CI frequently. As an example of a big change that we're able to make without touching clients in CI at all, we introduced a deduplication feature in image signing. So we were looking for a way to reduce the load that we're putting on the registry caused by signing images and pushing those metadata to the registry. And we found that only about 3% of the image signing requests that we were receiving were for brand new images. And we think this is because our reproducible builds were rebuilding the same image at different points in time. So by changing the logic of handling an image signature request to first verify, or attempt to verify signatures for the current key set and only if that verification fails, produce another signature. By introducing this logic, we are able to divert most of the registry load to the read path, which is much less expensive. So this was a huge performance and reliability benefit to image signing. Additionally, in this design, key management is also completely transparent to clients. So you can do things like rotate keys without updating CI. And to some extent, Vault provides this abstracted key management, but we found that we needed to implement several features on top of the features available in Vault. For example, if we need to sign with multiple keys in one signing request. So that's about all I have time to say about image signing. So now onto verification. So if you have to consider, or if you only have one, if you have to choose one point in your software supply chain to verify signatures, there's a bit of a trade-off that you need to consider. And that is the earlier you sign, pardon me, the earlier you verify an image, the faster you're able to get developers feedback. So you can, for example, stop and deploy before it even starts if there's a benign image signature verification issue, rather than stopping at mid-deploy and requiring a rollback. In contrast, the closer to runtime you verify a signature, the better security properties you get because the integrity of that image is guaranteed over more systems and the trusted compute base after verification is much smaller. So ideally you would verify in multiple spots to get the best of both worlds, both fast developer feedback as well as better security. The typical recommendation that you'll see in open source for image verification is to do so in the Kubernetes control plane using admission controllers. But at Datadog, we made the choice to verify at the no level within the container runtime with supplemental pre-deployed checks for that fast developer feedback. So to explain why we opted against using Kubernetes admission webbooks for image verification, we need to understand how they work. So admission webbooks basically allow you to write custom pieces of code and run that code to determine whether creating, updating, or deleting Kubernetes API resources is allowable. And there are several kinds of admission webbooks, but the typical choice for this kind of problem is the validating admission webbook type. The important thing to realize here is that the latency of the code that you're running in a webbook is added to the baseline latency of handling an API server response. And this is because the API server blocks on a webbook through a turn its response before returning its own response. And the large clusters that were running at Datadog were really careful about avoiding introducing back pressure on the API servers. So the latency goal that we have for admission webbooks is about 10 milliseconds at the P99. Unfortunately though, image signature verification is an online or somewhat online process since it has to talk to the registry. And this is quite hard to fit into that latency budget. In practice, the latencies we see for image verification are actually closer to 200 milliseconds at the median. So this on its own is enough for us to rule out admission webbooks as a solution here, but another disadvantage of the approach is that using the registry for signature metadata introduces a new cluster level dependency. So whereas previously the registry was only a dependency at the node level for pulling images, it's now on the hot path of the control plane. So this has some pretty serious reliability, reliability or changes to the properties for reliability. So one workaround that you could consider here is using a different kind of admission webbook, the image policy webbook. And this is one that has built in caching and retry features. So for example, it would patch around intermittent verification failures. But we're not very comfortable relying so heavily on essential cache like this to meet our baseline performance goals because if that cache were to be cleared for any reason, the API server could be put into a state of metastable failure where it's unable to keep up with the request following necessary to rebuild that cache and get back to its normal operation. So the alternate design that we chose was to verify image signatures in container D. So as a refresher for the architecture here, container D sits one level below the Kubelet and receives commands from the Kubelet like create container or start container over the container runtime interface or CRI. So container D is the process ultimately responsible for resolving an image digest, pulling that image from the registry and unpacking it to disk. To actually run the image, container D differs to a lower level runtime like run C over a shim layer. So the obvious place to integrate image verification to this flow is right after resolving the digest and right before pulling the image. This is pretty much as close as we can get to runtime as possible. So we've minimized the trusted compute base after signature verification. At Datalog, we truly believe that the container runtime is the most appropriate place to do image verification. So we're pretty excited to see momentum pick up around this discussion in open source. For example, cryo and cauda container runtimes, I think as of recently both have support for verifying six-door signatures on pull. But at Datalog, we use container D. So we've taken the initiative to contribute these features upstream. The basic approach is to add an image verification plugin system to container D. So users of container D can supply custom bits of code that container D would call to determine whether pulling an image is okay. So if the plugin returns a response saying the image is okay or the signature is verified, for example, container D would continue the image pull as usual. Whereas if the plugin returns a response saying that the image is not okay, container D would bubble up an image pull error to the kubelet. So because the kubelet sees an image verification error as a type of image pull error, this means that we benefit from all the features in Kubernetes for pulling images reliably. So for example, at the node level, we have the kubelet's image pull retry loop, which in our case would retry to patch around intermittent verification failures. And we also have the pod feature of image pull policies, which allows us to cache image verifications at the node level. So if a container has to restart, for example, you wouldn't need to do a second image verification. Additionally, the latency concerns that we had in the context of admission web hooks are not applicable here because image pulls are expected to be slow. Images are quite large sometimes. And all of these systems are built that in mind. An important thing to note though is that the only reason we're able to pursue this architecture of Datadog is because we're self hosting Kubernetes. So we have access to all these low level components on the node. So if you wanted to implement an architecture like this using a managed Kubernetes cluster, cloud providers would need to expose this kind of image verification configuration at a higher level, but we're pretty hopeful that industry will move in this direction eventually. So we've been running a temporary fork of container D with an implementation of this idea implemented. And we're working with the maintainers of container D to get these features into the two.overlease. So if you're interested in following along, I have the link to the tracking issue here. There are some PRs there, but just note that the implementation details are not yet finalized. So this slide shows the developer's perspective on the image verification system. So if you try to run a pod that has all signed images, everything works as usual. Whereas if there's any unsigned image in a pod, the pod is put into an image pull error state. And developers can get full details on the issue using pod events. And because this is a new and potentially confusing error for developers to see, we've made sure to make the error messages as friendly as possible and also included an inline wiki link for support and escalation. So the last thing that we need to talk about for image verification is how we're distributing the config for verification to the node. So in order to verify an image signature, the container D plugin needs several pieces of information. First, it needs a trusted public key set. And this is dynamic if you consider key rotations. Second, it needs a verification mode. So we need the ability to put the system into either audit mode where it's only checking but not blocking anything. Blocking mode where it would reject images that aren't signed or a disabled mode to disable the entire system. And additionally, we'd like to configure this at a relatively granular level. So we can be in different modes in different parts of our infrastructure. And finally, we need to distribute an image digest revocation list. So because our signatures don't have expirations, we need to be able to revoke them in some way. And typically we would prefer to revoke image signatures using a public key revocation, but this is a bulk operation since public keys are used over multiple signatures. So if we only have a handful of images that we want to revoke signatures for, we would prefer a revocation list for this use case. And our requirements for distributing this configuration are mostly guided by reliability. So we don't wanna introduce any new node level dependencies. We need the ability to roll out this configuration in a slow and staged manner. And also we'd like to have multiple fallback mechanisms for configuring this, or distributing this configuration. For resilience. To distribute the public keys in verification mode, we've taken a layered approach. So we bake a set of defaults into the node image. And this is kept relatively fresh using our automation for building and rolling out new node images. But in order to get faster updates than that, and we also have a dynamic update system that runs on each node. And what this does is it periodically pulls a config map that sits in each cluster and mirrors that to disk. And we're able to slowly roll out this config map just like we would roll out any application to multiple clusters for slow incremental updates to this configuration. Finally, we have an override config layer on disk that takes precedence over the dynamic update mechanism. This allows us to do pretty much zero dependency overrides if we need to. And in general, this layered approach allows us to roll out config at essentially any layer of the stack. So we can continue to operate the system under a wide variety of incidental constraints. The approach for distributing image revocation lists is actually a bit simpler. We simply bake it into the machine image on build and forgo any dynamic update mechanism. Just like how we bake defaults in for the verification mode and public keys, this has kept relatively fresh using our node lifecycle automation. The interesting thing to note here is that dynamic updates for this revocation list wouldn't actually be useful, even if we implemented it. And the reason for that is because container D is caching image pulls or pulled images on disk. So if you wanted to purge our infrastructure of a single image, we need to not only make sure that we don't pull it again on new nodes, but we also need to remove it from all existing caches. So in practice, our approach here is to prioritize draining nodes. So moving workloads off of them and deleting them that have run the invoked image at any time in the past. So we're nearing the end of the talk. So I wanna close out by discussing some of the challenges that we've encountered rolling out this system and recommendations that we have for others heading down this path. I'll say the most significant challenge we faced in rolling out image verification is that you can't easily configure the verification mode by Kubernetes namespace. So at the cluster level, namespaces are handy boundaries to separate different tenants in the same cluster. But because nodes can run containers for multiple namespaces on the same node, we don't really have this boundary available to us. What this means is that you might need to sign all the images in a given cluster before you can turn on blocking your verification mode. And this could be quite a challenge, especially in a large multi-tenant cluster, simply because there's more images that you'd have to sign. What we found here is that having dedicated clusters for more sensitive types of workloads not only has the obvious security benefit of extra isolation, but also this allows you to more easily sign all of those images in the cluster because there are fewer of them. And then you can go from audit mode to blocking mode in that cluster sooner. Second, in an organization with a lot of diverse CI configuration, it's a lot of work to globally add a new signing step to all of those builds, even if the integration on each one is relatively simple. And what we found here is that monorepos and consistent build tooling like Bazel make it quite easy to make sweeping changes like this. And my recommendation in general for rolling out image signing is one, to leave ample time for it, but two, simultaneously roll out audit mode signature verification. This not only allows you to develop operational experience for the verification system sooner, but also it will provide an additional source of telemetry for what images are signed where. Finally, node level image verification is relatively uncharted territory. So it's been a challenge to develop new techniques here. That said, we truly believe that the reliability benefits are well worth the effort and we're really excited to guide these features into container D2.0 so they're more readily available to everyone. So to close out, three takeaways. First, evaluate whether encapsulating signing images in a hardened service is worth the security and scalability benefits. This is very likely not the right decision for everyone, especially if you're only building images in a few places. But in Datadog's CI environment, there's no question that this was the right choice. Second, think critically about using admission controllers for complex verification processes like image verification. At scale, you may come to the same conclusion that we did at Datadog and prefer the properties of image verification in the container runtime. Lastly, these verification features are not yet merged into container D. So if you're a user of container D and you think you might use a feature like this, I welcome you to join the conversation at the link on this slide and give us your input. That's all I have. Thank you for your time.