 Hello, it's great to be joining SigStoreCon 2022. My name is Ethan Lohman, and the title of this talk is Who's Verifying Your Signatures? Approaching Private Container Image Signing. In this talk, I'll discuss what we've learned at Datadog as we've begun signing our internal container images. Datadog, in case you're not familiar, is an observability and security platform collecting tens of trillions of events per day from millions of hosts. To provide this service, we run a lot of container images internally. We have over a thousand employees in our engineering organization pushing containers daily to Kubernetes, where we run hundreds of thousands of pods and dozens of clusters across multiple cloud providers. With such a substantial Kubernetes footprint in a diverse set of build and release pipelines delivering software to these clusters, internal container image integrity is a special security focus of ours. Today, I'm here representing Datadog. I'm a senior software engineer on a team called Software Integrity and Trust, where our mission is to secure our internal and customer-facing software supply chains. To support some of this work while contributing back to open source, I'm also one of the maintainers of the Go Tough Library, which is a Go implementation of the update framework. This project is one of the foundational components of SigStore's cryptographic route of trust. Some of my interests are cryptography, cooking, and cycling. In this talk, we'll begin by briefly discussing why you might want to sign and verify private images running on your internal infrastructure. Then, I'll give a survey of the current and past technologies available for image signing with a focus on how verification works. I'll discuss how Datadog evaluated these technologies and how we ultimately chose to sign and verify images internally. Hopefully you'll take away things you may want to consider when you're choosing how to sign and verify your own images, especially in an internal context. To begin, why should you sign and verify container images internally? Just like in the open source setting, verifying an image signature allows you to have a high confidence that the image comes from a trusted source, such as your CI or a build environment, and that it has not been modified before it runs. Here, I have a very basic model for an internal software supply chain targeting a Kubernetes runtime. First, developers write code and push that code along with its dependencies to a source code management platform like GitHub. At some point, continuous integration, or CI, picks up the code and builds binaries, container images, and other artifacts from that code. Container image or OCI build artifacts are pushed to container registries for later use. Then, when the code chain is ready to go live, continuous deployment creates or updates Kubernetes API resources to use the updated container images in configuration. These workloads are managed by the Kubernetes control plane and are eventually scheduled as pods on Kubernetes nodes, where the container images we built are finally pulled and run. Internal software supply chains like this one are composed of many parts and your trust in each of these parts may vary. For example, some parts may be more exposed to human operator access, which can be a source of weakness in the supply chain if an employee laptop is compromised. Signing and verifying images in your internal software supply chain allows you to reduce your trust in each of these software delivery components. So your production infrastructure and data is not automatically at risk if one component of your delivery pipeline is compromised. For example, if you're signing container images, you're substantially protected against compromise of the container registry. Ideally, you sign images in CI as soon as possible after they're created, using keys only available to CI. This signature then represents an attestation of provenance, authenticating that the image came from CI. Then, every time the image is used or referenced later on in the software supply chain, verifying the image signature allows you to reduce trust in each of the components in between. The general idea is that the closer to runtime you verify image signatures, the fewer components of your software supply chain you need to fully trust. Now, once you've decided to sign and verify images to protect your build and release pipelines, what tools are out there to start signing? Six-door's co-sign tool is of course one option, but I think it's useful to understand the context around Six-door so you can make an informed choice about how to sign images for your particular situation. The reason for this and the thesis of my presentation is that there's no one right way to sign an image. A signature on its own is a very little value. The real value is in verifying the signature and in doing so at the right points in time. Therefore, the best choice for you will heavily depend on who or what service is verifying your image signature in other contextual factors. This means that if you're signing both internal images or an open source or customer-facing images, your strategy might look different for both situations. So with that in mind, we're gonna survey these tools out there for signing images. For each of the technologies, we'll discuss at a high level how it works, how it fits with Datalog's internal image signing and verification use case, and when it might be a good fit for other use cases. Docker Content Trust was one of the first systems introduced for signing container images. It was released by Docker back in 2015 and backed by an architecture called Notary V1. Notary V1 is based on the update framework, which is a model for building software update systems that are resistant to a variety of nuanced attacks. Notary's architecture consists of an API server and database that sit next to the container registry. Notary was a good start for image signing, but there were some issues with the design implementation that hindered adoption. However, the project yielded a lot of real-world lessons to inform future designs. For example, one of the shortcomings of Notary is that the signatures live outside the registry and a sidecar service that managed its own route of trust. This means that it's not easy to copy an image and its signature between registries and have the signature still be valid in the new registry. Because it's using Docker Content Trust requires the Notary service in addition to the registry itself and Notary wasn't implemented by most cloud registries, you're also severely restricted with what registries you can use to host or sign images. For our internal image signing use case at Datadog, this is the main reason why Notary V1 wouldn't make sense for us. In order to deploy to multiple independent data centers, each with many Kubernetes clusters, we have a system that replicates images across cloud registries. We may build and sign an image in just one registry, but to support verifying the image in every data center it runs, we need to be able to verify the image from any registry it's copied to. So for Datadog, ease of signature replication is one of the top considerations for how we sign images and Notary V1 does not have good support for this. The lesson here is that if you can use the registry as the storage backend for image signatures, you can sign and verify images in any registry without requiring any changes from the registry provider. Additionally, if you separate your route of trust from the registry, it's a lot easier to move signatures between registries to support modern multi-registry deployment patterns. So in summary, Notary was useful in that informed future developments, but it's probably not the right choice for a new project. The Grafaze project, which was introduced by Google in 2017, is composed of a suite of tools that are meant to provide a general purpose solution for auditing and governance of the software development lifecycle. It aims to tackle more than just container image signing. Grafaze is an artifact metadata API that represents structured events that occur throughout the software development lifecycle. These events could be things like the result of a vulnerability scan for a container image or an attestation that a built artifact came from a trusted system. These events can optionally be signed using PGP to create a cryptographic attestation that the event occurred. This kind of attestation is a bit more flexible than what a basic image signature provides in that it conveys not just whether the image is authentic or not, but that certain steps were taken in creating the image. Some of the costs of this flexibility is paid by the verifier of the image. Rather than just verifying a Boolean status, is this image okay or not? The client would need to verify a more detailed policy, potentially checking several attributes about the image. The metadata in Grafaze is intended to be verified using Critis, which is a Kubernetes admission controller that verifies customized mobile policies about your images before admitting workloads to the cluster. So how do you evaluate if Grafaze is a good choice for you to sign container images? This largely depends on how you intend to verify the signatures. If you're releasing images publicly and your customers will be verifying your signatures on their own infrastructure, Grafaze is probably not the right choice. It doesn't provide any mechanism for public key discovery and as far as I'm aware, the API is not meant to be exposed outside of internal infrastructure. However, if you're using Google's managed Kubernetes product, GKE, Grafaze might be easy to adopt since GKE has first-party integrations with Grafaze and Critis. One final thing to note is that Grafaze uses PGP signatures, which presents a number of problems both in ergonomics and in security. For example, you probably won't be able to use any cloud HSM to securely store and manage PGP signing keys. Modern cryptography can be much safer than PGP, so it's likely not the right choice for new projects. At Datadog, we decided to pass on Grafaze for several reasons. First, we want to use better cryptography than PGP, ideally something that would be supported by HashiCorp Bolt's transit secrets engine. Additionally, we preferred using the registry as the storage backend for signature metadata, since as I mentioned when discussing Notary V1, this makes cross data center data replication work the same way as how we distribute images. It's just one less service to operate. Finally, we managed our own Kubernetes clusters across multiple cloud providers so we wouldn't benefit from the GKE integration. The Notary V2 project is an evolution of Notary V1, which began in 2019, aiming to fix the issues discovered in the Docker content trust rollout. Notary V2 supports signing images using X509-based PKI. The signature metadata is stored in the registry alongside the signed images. To link signature artifacts with the images they sign, Notary V2 uses a new part of the OCI spec called referrers. This is probably a good choice in the long term because it lets the registry know about the dependency between the signature and the image, which makes lifecycle operations more reliable. However, because the referrers API is so new, it lacks widespread registry support. Notary V2 is still in the alpha stages and it's looking promising, but it's not production ready yet. So what's standing in the way of Notary V2 in production? Echoing the theme of this whole talk that really depends on the verification context. In a public context, Notary V2 is still missing a certificate authority or a mechanism for public key discovery. So while it may be possible to reform the signing and verification operations, it's unclear how users of your images are supposed to know what public keys to use for verification. In a private context, like in a Kubernetes admission controller, Notary V2 could work without the complexity of our certificate authority. With the ability to control all the verification clients, you may be able to configure each of the verifiers directly using self-signed certificates. However, there's not an off-the-shelf solution I'm aware of to verify Notary V2 signatures within an environment like Kubernetes, so this is something that still needs to be implemented. Before we get to talking about Cosign and the rest of SickStore, I want to quickly mention how the update framework fits into this picture. The update framework, abbreviated as TUF, is a framework for building secure software update systems. It includes a number of freshness and snapshot consistency measures that make updates resistant to a variety of subtle attacks. For example, it protects against rollback attacks, in which an attacker rolls back software versions to a previously valid and signed version that had some exploitable vulnerability. Notary V2 was based on TUF, and SickStore even uses TUF for certain aspects of key management. However, it's conceivable that you could use TUF for the whole image signing and verification process. At Datadog, we gave this an earnest try and implemented a proof of concept, but ultimately we found that there was a mismatch in the threat model. Tough snapshot feature, which is meant to make sure our client has complete and up-to-date set of trusted files and no more doesn't easily apply to image registries. I won't go into too much detail here, but we found that snapshotting brought a lot of complexity that didn't deliver concrete value we could map to our internal registry threat model. However, the idea still has potential with more work. In the Notary TUF project, there's an ongoing effort to adapt TUF to image registries, but this is still on the proof of concept phase. So this brings us to Cosign and the rest of SickStore. Cosign signatures are conceptually pretty close to Notary V2s. The signing format and payload is slightly different, but like Notary V2, the signature image is written to the registry. Cosign loosely couples the image and its signature using a predictable tag naming pattern, which gives it broad registry support. SickStore sets itself apart from the competition in open source image signing by including an easy-to-use certificate authority called Fulcio that issues short-lived certificates to signers based on their OIDC identity. Recore makes the signatures auditable and verifiable past the certificate expiration. So there's no need to re-sign images or rotate keys. As a signer, in cases where your CI has an OIDC workload identity, like in GitHub Actions, this means that without any of your own key management, you can use a short-lived certificate that is cryptographically tied to a CI job. On the verifier side, this also lends itself really well to public key discovery. Since if you trust SickStore's roots, then you just have to verify an email address instead of keeping track of public keys for each software publisher. These components will be covered in more detail throughout the conference today, but together, Fulcio and Recore make co-sign a really ergonomic choice for signing and verifying open source images. However, it's a little more nuanced if you're signing for internal verification. For internal signature verification, organizations may prefer to have a private read of trust rather than using SickStores. Additionally, using Recore does by design make certain information about your usage public, such as how many artifacts you're signing. This may not be a problem for you, but if it is, then you'll need to implement your own key management for co-sign. That means you'll need to think about key lack times in rotation, how you'll re-sign images already running on your infrastructure to keep them valid, and how you'll securely distribute public keys to all of your verification clients. So, as always, there are trade-offs. And if you do decide to use co-sign, how you use it will largely depend on who or what services you expect to be doing the verification. And it's crucially important that you think about how verification will work and that you make it easy because an image signature that's not actually verified is of little value. So, how are we using SickStore at Datadog today? We're trying out bits and pieces already, like recording tough timestamps to Recore and the Datadog agent integrations pipeline. We're also planning on signing our open source images using co-sign once certain key management features become available for large open source projects. We'd like our customers to be able to verify our image signatures with minimal configuration of their co-sign client. There are some features in the works to help accomplish this, like something called SickStore Trust Delegations. For internal image signing, which has been our primary focus, our approach looks very different. When we began this project earlier this year, we decided that running Fulcio and Recore internally was hard to motivate. Our internal service identity system isn't compatible with Fulcio supported identity providers. So using the Fulcio CA internally wouldn't make sense. It's easier for us to manage trusted keys directly. Additionally, running a transparent log like Recore internally didn't make sense for us. In our situation, there are easier means available to us if we find we need internal tamper proof logging. For example, we could use an S3 bucket with the right once read many setting enabled, which makes the bucket effectively immutable with no need for complex auditing setups. This kind of immutable logging probably wouldn't work as well in an open source setting though. So it's just an example of how you can sometimes simplify your approach internally to reduce complexity and operational burden without losing functionality. We also decided that running a timestamp server like Recore internally didn't make sense for us. This is because the design of timestamp servers makes the most sense when they're run by a trusted third party, truly isolated from the environment where the signing happens. For the basic image signing component, we also opted against using cosine. Instead, we're using custom signing servers, which I'll explain soon. Our requirements for signing and verifying images internally were as follows. First, we wanted an easy way to encapsulate signing into an API service in front of HashiCorp Vault, which we use internally for secrets management. Whereas a coastline client in CI would call Vault directly, we preferred to have only one highly hardened, monitored and auditable signing service making those requests to Vault. The reason for this is that even though Vault has audit logs and the ability to lock down client permissions, it would be difficult to interpret what payloads a rogue CI job is signing just from Vault audit logs. So, we decided to have an RPC service do the signing with Vault on behalf of authorized CI workloads. This also makes integration with new repos and updates easier, since updating image signing clients across many repos would be a headache if the client changed often. By making the client as simple as possible, deferring most of the work to the signing service, we considerably reduce how often we'll need to update signing clients across many repos. To implement this signing service, we could have potentially used Cosign, but at the time of implementation, Cosign was primarily a CLI tool not meant to be used as a library. However, this is a known issue and there's already been some work done towards making it easier to use Cosign as an embedded library. The second requirement for image signing, which is a little more subtle, is that we wanted to avoid creating new image tags in the same repository as the signed image. This is because at the time, our internal registry retention policies were based on the number of tag digests for each repository. So if we signed every tag, that would double the rate of tag churn, potentially causing images to be prematurely expired. Unfortunately, Cosign doesn't currently allow you to override the logic for locating the signature for a given image. If this were the only reason that Cosign didn't work for us, we probably would have addressed this problem on the registry side by improving our retention policies first. Finally, we wanted to make sure that image verification was highly reliable across many large Kubernetes clusters. It can take up to several seconds to verify an image with Cosign since it involves many around trips to the registry. We weren't comfortable adding this kind of latency to an admission web hook due to the back pressure it would put on the Kubernetes API server. So this meant that we wouldn't be using an off-the-shelf admission web hook like Kyverno. Instead, we plan to implement runtime image verification at the node level within container D, which is the container runtime we use. We're currently working with container D maintainers to merge these new node level image verification features upstream. It's worth noting that we're only able to customize container D in this way because we manage our own Kubernetes clusters. Cloud provider managed clusters don't offer this level of customization. Companies with smaller clusters probably wouldn't have this reliability problem and an admission web hook may work just fine for them. So we evaluated that the functionality we needed from Cosign was going to be dwarfed in both implementation and maintenance by all the surrounding key management and container D customizations. So it was simple for us to write a small library that implements just assigning and verifying functionality we need tightly integrated with our internal infrastructure. The design is very similar to how Cosign works, but with a different payload format and signing envelope and a different registry structure to work around our tagging issue. This is a high-level diagram of what our whole system looks like. First, CI publishes an application image to our build registry and sends offer requests to our signer API with the digest of the build image. The signer API constructs the payload, which is an OCI descriptor with custom annotations and asks the vault to sign that payload using an ed25519 key. The signer service packages that signature and a dead simple signing envelope and pushes it to the build registry. In the background, one of our registry systems is replicating both the application image and the signature metadata from the build registry to a registry in each production data center. Zooming in to just one Kubernetes node, we see that container D is about to run an image. First, it talks to the registry to resolve a digest for that image. Container D uses the image verifier plugin system we're contributing upstream to ask our custom verifier plugin whether the image digest is okay to run. Our plugin pulls the replicated signature metadata from the registry and verifies the image signature in payload. If the signature is valid and the sign digest matches the one in the request, the plugin tells container D to finish pulling and running the image. We manage key rotation through both built-in key versioning. Valid public keys are baked into the machine image and refreshed from a secure service in each data center. When we issue new keys, we resign all images that have recently run in production and have valid signatures with the previous set of keys. This is one of the main choices that allowed us to simplify our key management. We observed that we build and push many more images than are actually being run in production at any point in time. So the number of images we have to resign with new keys to support key rotation is a manageable size. We're still gaining experience operating this in production, especially the key management components. So we'll have more to share on how this approach works in practice in the coming year. So this has been DataDog's approach to container image signing, especially in the internal use case. We did a thorough evaluation of all the major image signing projects out there, and Cosign definitely has a lot going for it, especially in the open source setting. However, image signing only has value if signatures are verified. So as you approach container image signing, think carefully about who you're signing for, whether it's an external customer or an internal service. This will likely affect what signing tools you use and how you design your key management strategy. That's all I have for today. So thanks for listening, and I hope you enjoy the rest of 6-door con.