 Hey folks, my name is Dan Mangum and I work at a company called Upbound, and I'm joined today by John Johnson from Google. First of all, we want to thank you for taking time out of your KubeCon experience, whether attending virtually or in person to attend our talk today. We hope that wherever you are, you are safe and well. Today we are going to spend some time talking about what you have likely heard referred to as container registries. The reason I say likely is because these so-called container registries don't actually have much to do with containers at all. John and I have grown to become fast friends with registries through our work on crossplane, go container registry, and GCR, and we for one feel like they are a misunderstood bunch. You may think that registries are boring, that they don't know how to have fun, but we are here to tell you that registries are much crazier than you may have thought. You just have to get to know them. Throughout the four acts of this talk, we'll get to know who the registries really are, not just who they are said to be on Twitter and Hacker News. Thanks, Dan. So let's talk about the registry. Most people don't have to think about how a registry works. You docker build, docker push, docker pull, and docker run images. For the most part, that's a good thing. It just works, and the registry is a very boring part of your life. I think that's a little sad. I also think it's not true. So let's look at how the registry sees the world and what happens when you interact with it. So first, let's go over all the data structures that a registry has to deal with. It's going to be more or less OCI 101. So the fundamental primitive in OCI is what's called a content descriptor. You can see here it's a very simple structure with three fields. It's a media type, the digest, and the size. So for a given piece of content, the descriptor tells us how to interpret those bytes, how many bytes we should expect to read when we're fetching it, and the exact hash of the content of those bytes. So we know exactly what we get when we get to fetch it. And this allows us to fetch content over the Internet safely, and this is the most crucial bit of OCI and what I think is most interesting. You might notice that those three properties map pretty directly onto HTTP headers. So it's very straightforward to serve this content via HTTP. As you might have guessed, the registry API speaks over HTTP. So the media type maps to this content type, size maps to a content length, and the digest maps to this OCI kind of specific thing, Docker content digest. So in a registry, this is an image. An image is a JSON document that contains content descriptors pointing to one config file and one or more layers. So the config file mostly just describes how to actually run a container, but the layers make up that container's file system. And layers are usually GZipped Tarballs, but some folks have started doing a lot more interesting things recently. Beyond just an image, there is an image index, which similarly is a JSON document that contains content descriptors. Those point to images, and this is commonly used to represent multi-platform images. And that's basically it. That's pretty much all of the primitives in the OCI universe. If we drew some lines between these objects, we'd have this wonderful graph. And you can see anywhere there's an arrow is kind of a content descriptor from one object that references another. And you can see here that while some objects are manifests, other objects are blobs that the registry doesn't really know that much about. We just have arrows coming into those and then coming out. There are a couple other interesting features that the registry knows about that I as well mentioned. So for legal reasons, you might not want clients uploading certain pieces of content. As an example, say Microsoft doesn't want you distributing the Windows-based layers. To work around this, the registry has to understand this concept of foreign layers that need to be downloaded from somewhere else, say Microsoft.com, not the registry you're pulling from. Another interesting feature is similar to Kubernetes manifests, OCI has this concept of an annotation map where you can just stuff arbitrary strings. So we're not really just serving tar balls, right? We have one this nice content addressable object storage thing for blobs. We also have this interesting Merkle DAG API for the manifest. Beyond that, there's this generic API where you can store arbitrary things that have type safety features and are memory safe because you know how large this should be. They're tamper proof, they're digest, and they're human readable because they're JSON. Thanks to Kubernetes, they're also ubiquitous and you always have access to a registry more or less because it's cloud native. Most registries support some kind of garbage collection, but you can also put interesting arbitrary metadata on things. So this is amazing in theory, but what's important is that it actually exists and you can use it. So what else might you want to do with these primitives? This is where things start to get a little weird. When we talk about sending data between computers, we often make distinctions between different types of data. For instance, you likely don't feel like pulling a container image is the same operation as watching a video on YouTube. And in some respects, it's not. However, all data transmission ultimately boils down to an order of operation and bits on a wire. As John mentioned, the distribution spec broadly defines two categories of data, blobs and manifests. Let's start out by talking about the former. In the common case, blobs often represent GZIP tar balls containing a file system layer or in simpler terms, a compressed directory of files. However, this is certainly not a requirement as while the OCI image spec defined the set of acceptable media types, the distribution spec has no such requirement. Learning all of this was a surprise to me because I had only ever seen the registry in professional settings. Little did I know that the registry was actually very open-minded, so different than the image spec. Registries are frequently referred to as a form of content-addressable storage. What does that mean? If you have ever used a blob storage solution, let's say an S3 bucket, you know that you can download content from the bucket using a URL. This URL frequently indicates the location of the content, but may not inform about the actual bits in the content. This means that when you request content from a given location, the onus is on you to check and make sure that you are given what you asked for. This is frequently referred to as location-addressed. Content-addressed, on the other hand, means that we can ask for exactly what we want by submitting a digest of the content rather than a location. Now, there's nothing that says that a registry may not return some malicious content even in a content-addressed system, so clients should still check that the digest of the content they receive matches what they asked for. So you could say the registry is pretty well connected. It's rare that I ever ask it to find some content for me and it's unable to do so. Another thing you may not know is that the registry is a great listener. It knows that folks communicate in different ways, so it offers a few different mechanisms to submit content. When pushing a blob, a client may be able to upload all of the content in a single request, or it may require doing so incrementally. The distribution spec accommodates both of these scenarios by supporting multiple upload workflows. Let's take a brief look at the different strategies and the advantages and drawbacks to each. The most common method for uploading a blob is chunked. Typically, this doesn't actually mean that the blob is being uploaded in multiple chunks. For instance, we'll see in a moment that Docker will upload via the chunk method, but then send the full blob in the patch step. The advantage to this method and one reason it is most commonly used is that it supports resumable pushes. This means that if a connectivity is lost during upload, a client can resume pushing content without re-uploading what is already there. Another advantage is that if the content is stored uncompressed, the client can compress and calculate the digest of the compressed content in a streaming fashion without having to actually write the compressed content to disk first. The distribution spec also supports uploading a blob monolithically via two different strategies. The first, post then put, is similar to the chunked upload. However, using this strategy, we lose the ability to resume pushing in the event that our put with the blobs is unsuccessful. The last strategy is a single post, and it is specified as optional and in practice is hardly ever used. Why? Well, an important thing to note in the previous two workflows is the step of obtaining a session ID. This is returned in the form of a location URL, but that location does not necessarily have to point back to the registry. This means that a registry may offload the actual upload to a different service, something that would not be possible in the single post scenario. So now that we've seen the multiple different methods for uploading blobs and manifest the registry, let's actually look at how a real client goes about uploading a multi-arch image in this case. So here we're at iacre.io slash kube.com, which is the namespace that we're going to push this image to. And we're using crane copy here, which is going to take the crossplane crossplane v1.4 image from Docker Hub and copy it over into our iacre registry here. In the UI here, you're going to see the different operations that are taking place as it goes through this. So let's go ahead and kick it off. All right, so you'll immediately see some operations start to flow through. And I'll go ahead and move this out of the way since it has completed pushing all the different components. And we have a time delay here, so it's a little more visible. But what you're going to see happening is basically checking for the existence of those blobs and then starting an upload for the content of a blob before eventually committing the blob and storing it. And once all the blobs for a given manifest are present, then we'll upload the manifest itself. And since this is a multi-arch image, when we actually get to the end of uploading our four different architectures, you'll also see a manifest list that points with our v1.4.0 tag to those manifests. So that'll pop up here in just a moment. There you'll see and see that it points to each of these. And if we also look at the different operations that took place starting off the beginning, you'll see that we first checked to find the existence of the various blobs we were going to upload before then starting to do our uploads via our patch requests and eventually committing an upload for each with a put request. While we have established that the distribution spec is rather unopinionated, some registries are more picky than others. That means that submitting arbitrary content to a registry may not be accepted. However, nearly every registry will accept content that adheres to the OCI image spec, which specifies that layers come in one of a few media types that are meant to represent a file system change set. If you have ever authored a Docker file, you're likely familiar with these chainsets. They are the layers that are generated when you add or remove files or perform other operations. However, unless the image is actually passed to a container runtime, these layers are nothing more than tarbles. Knowing this, we can take advantage of the fact that we can add any content to a tarble. And as long as the client that uses it in the future is able to extract it, these layers can be used as a general container for distributing content. One place the strategy is exercised in practice is a crossplane project. Crossplane includes a package manager, which is responsible for installing various sets of Kubernetes resources and maintaining controllers for custom resource definitions. The manifest for these resources are bundled into an OCI image for easy distribution by building a stream of YAML, writing it to a file named package.yaml, and packaging it up in a gzip tarble. The result is an artifact that can be pushed to any registry, even ones that are on the more picky side. Strategies like the one crossplane employees are useful, but there is nothing that says a given image is a crossplane package outside of the way that we parse it. This presents some pros and cons, and some of the negatives of the approach are attempting to be solved by the OCI artifacts initiative, which aims to take the flexibility of the distribution spec and provide a standardized method for defining new artifact types. These artifacts take advantage of the existing manifest spec by optionally defining custom media types for the config and layer blobs. This means that something like a helm chart can define helm specific configuration, rather than providing an empty or mostly empty image config. It also means that the dedicated tooling can identify the type of artifact before attempting to handle it. A drawback to this approach is that the explosion of media types and the fragmentation of tooling to handle OCI artifacts is really exacerbated. When designing the crossplane packages, one of the things we wanted to ensure was that while we do offer dedicated tooling, users could interact with crossplane packages using nothing more than their existing registry client of choice. As more and more artifact and media types are introduced, we start to lose out on the original benefit of the registry and related tooling, that it's ubiquitous. Remember though, from just a perspective of the distribution spec, we don't care. So now we've seen two ways that registries can be used to distribute data that is not meant to be run as a container, but we really haven't pushed it to the limit yet. If we look at the two data types in the distribution spec, manifests and blobs, it's clear that manifests are the limiting factor here. After all, a blob is just content and content is just bits. In the previous two examples, we either shoehorned our content into an accepted media type or defined a new one. However, if we don't ever have a manifest referencing our blog, the content is irrelevant. Let's take a look at some arbitrary content we might upload. So as mentioned, the distribution spec actually doesn't require that a blob ever be tied to a manifest. In practice, a lot of registries will require that a blob be tied to a manifest or it'll eventually be garbage collected. But we're going to show some examples of what you could do if you weren't tied to any of the restrictions around various media types that a registry might impose. So we're going to use some potentially nontraditional blobs and see what it looks like to upload them actually manually just using curl. So in my KubeCon directory here, I have our hello.tar and a kubecon.jpg. I'll start off by doing the kubecon.jpg. So if we open this up, this is just the logo for KubeCon. And I'm going to manually do what many registry clients are actually going to do and start off with our chunk to upload by doing a post to get our location that gets sent to us by the registry. So once we get this and let me move this up a little bit so folks can see. You're going to see that we're given a location in our response. So I need to take this UUID and make sure I use it in my next request and I'll go ahead and craft that. But the next thing I'm going to do after I've gotten my location is make a patch request with the content of my upload. And you'll see here that in our UI, we actually see that upload that's present, but it hasn't yet tied to been tied to a blob yet. So the next thing we need to do is actually commit this upload and say that this is the full amount of content that's present in this blob and also provide the digest that should exist for the blob. So I'm just going to take the SHA-256 of this kubecon.jpg and then I'm going to append that for our put request. And our put request goes to the same location, but it appends a parameter to the URL with our digest, the one we just calculated. When we submit this, you'll see that our blob is now actually present in the registry and we have a nice pointer here to our upload. The last thing I want to do is go ahead and pull this blob and that can easily be done with a get request as well. And I'm just going to give the output of this blob into a different jpeg, pull.jpeg. And you'll see that we represented that pull here in the UI and if we go ahead and open up our pull.jpeg, we'll see that we got the exact same image back. Thanks, Dan. It seems like this registry is a really helpful person, but can we trust them? So generally, if you're pulling an image from a registry, you intend to run it often in prod. This makes the registry a really juicy target for bad actors. If hackers could manipulate what's in your registry, it's trivial for them to execute arbitrary code on your cluster. I stole this diagram from the salsa project. I stole this excellent diagram from the salsa project. These last two hops are what we're worried about here. How do we avoid the bad red things? So if someone compromises the registry, we don't want them to be able to get things into our cluster. The answer to can I trust the registry is if you deploy by digest, yes. The answer is that you should always deploy by digest. Write that down because that's going to be the key takeaway. If you deploy by digest, you get all the benefits of content addressability that Dan mentioned earlier for free, and you don't have to worry about what I'm about to show you. However, if you deploy things by tag like most people, your trusted computing base just got a lot bigger. Can you trust your registry domain? Can you trust that no one's man in the middle in your client or manipulating what mirrors you use or have stolen credentials for your registry or even have bypassed your registry credentials? Most people know that piping curl into bash is not a great idea. This is bad enough to happen on your workstation, but the stakes are a lot higher with a registry. When you deploy by tag, you're basically piping curl into bash in production. Don't do that. Again, you should always deploy to production by digest. If you can't do this for some reason, I would invite you to rethink the design of your supply chain. Why am I so passionate about this? The answer is typos, and if you know me, you'll think that's funny. But when I first started at Google, I noticed a pattern. A lot of our customer support requests, including from internal teams, ended in the exact same way. Eventually, someone noticed there's a typo and they closed the button. This one in particular happened a lot. You notice we run the Google Container Registry, not the Google Registry Container. You might be thinking, who cares? That pool failed, and they eventually noticed the typo. Well, what if the pool didn't fail? In the case where you're trying to pull by a digest, your client would check the response it gets from the server against the digest you requested and it would immediately fail if they didn't match. But if you're trying to pull an image by tag, GRC.IO could respond with any valid image and your client wouldn't complain. When I first realized this, I was kind of terrified. I immediately tried to register GRC.IO, but found that somebody already owns it. What would happen if that person was, say, malicious? I eventually convinced myself that this wasn't a huge problem. I just don't make typos, right? That's easy enough. However, a few years later, I saw this DEF CON talk by Artem Dynaberg and I have been perpetually terrified ever since. So let's talk about bit squatting. So when you pull an image from a registry, you're making a bunch of HTTP requests to some server. In this case, we end up making an HTTP request to Docker Hub to get the manifest for this tag 2 of the registry image. And HTTP is generally layered on top of TCP IP. So the first thing we need to do is resolve Docker.IO to some IP address. This generally works and we send the request to Docker Hub, which we trust more or less. But how could this go wrong? So I highly recommend just watching Artem's talk or reading the blog post, but I'm going to give a very hand-wavy explanation of bit squatting. So computers are imperfect. Sometimes a bit will go rogue and just flip from a zero to a one or vice versa. Many things can cause this. For example, just a lot of heat or faulty manufacturing process or even cosmic rays from outer space. If you happen to get remarkably unlucky, this can happen to your machine and flip a bit somewhere in a domain name. For example, in an image reference. Boom. We flipped a bit in this first character of Docker.IO. What now? As before, the first thing we need to do is just resolve an IP address. Unfortunately, we're looking for the wrong server. Instead of Docker Hub, our client ends up talking to whoever owns IOCRA.IO. So I happen to own IOCRA.IO. This was an enticing purchase because it's both a single bit flip and a single character away from Docker.IO on a QWERTY keyboard. It would take a little too long and probably be unsafe to actually flip a bit via radiation. So I'm going to cheat for a demo and just pull from IOCRA.IO directly. So first let's establish a control group to see what we expect to happen. So if we Docker run this registry image, we're going to have locally a registry running. So we're going to pull from Docker Hub and it's up and running. And just to demonstrate that, let's actually copy something into there. So you can see lots of traffic. It worked. I copied the thing. So that's what we expect to happen. Now, instead of running this from Docker Hub, what happens if by some magical, unlikely sequence of events, we accidentally pull this from IOCRA.IO. So you can't find it locally. We pull from IOCRA and it starts up in the same way, but you'll notice this log line, which wasn't in the previous image. Interesting. And you'll also notice that it still works and I could still copy images to it and I can still list what I've pushed there. It works as a registry. So what just happened? Imagine that an attacker wants to access your production cluster. They could serve you a Bitcoin miner and that might just work for a little bit, but you would probably notice pretty quickly that your deployments weren't doing what they're supposed to do and then they were instead just mining Bitcoin. So what I've done for the demo is set up a registry that just proxies your request back to Docker Hub, but before serving it to you, you can see the image to inject a malicious payload. In this case, that malicious payload just logs a message and immediately runs your intended image, but you could certainly imagine doing something much more nefarious like Bitcoin mining a reverse shell, proxying all the traffic to your pod, really anything. You can see here that I've just replaced the entry point with something a little evil, but how likely is that actually to happen ever? Maybe one in a billion, one in a trillion. Maybe this isn't really something to worry about. So it's hard to really say concretely how likely that's to happen, but based on some numbers Docker Hub put out early this year, I think that maybe possibly this is something you should worry about at 300 billion pools we're getting close to just very large numbers. So this stuff does happen in practice. You don't have to take my word for it. It's just awesome sources for further reading. And one more time, always deployed by digest because computers are broken and terrible and you never know who's on the other side of that TCP connection. But John, didn't we just say that the registry is our friend? Surely if we did trust the registry, there could be some pretty interesting things we could do. I think we should wrap this up with a nice little chat with the registry. So where have we been so far in this talk? We covered the registry stores content and that content is just bits. We also covered how manifest can provide some structure to those bits and when used recklessly, can lead to some pretty severe repercussions. But what if we threw caution to the wind and really took advantage of the flexibility of the distribution spec? Well, we could build a chat service. And in fact, we did. Since we aren't able to be in person with folks at KubeCon this year, we thought we could at least do a live demo. As we round up this talk with one of the most extreme examples of pushing the registry to the limit, feel free to join us at eauquer.io slash chat to talk about how this presentation went and ask any questions you might have. So how does this work? It's built on what John has advised us to never do, pull by tag. Since the content of a tag is dynamic, we can essentially build our manifest just in time. Each layer of our chat image contains a log of messages. In between polls, new messages are appended to an in progress upload. The next time someone pulls the chat image, we commit that upload as a blob, then serve a manifest that points to it and all the other blobs. On the client side, when the image is run, all previous chat messages are sent to standard out. The binary connects to a WebSocket service on eauquer.io, and the user has shown any new messages that are sent and is able to send their own. The WebSocket service not only brokers communication between users running the chat image, but also uploads new messages to the in progress upload in the chat repository and the registry. Therefore, when the image is pulled again, the user has all messages that have been sent up to that point and starts receiving new ones from the chat service. We hope you have enjoyed our presentation, and we especially hope you now see the registry in a new light. And next time you docker pull, remember that thing you're talking to is more than just the glorified FTP server you are led to believe it is.