 I think we're good to go. Good afternoon. Welcome from Docker Push to Bytes on Disk. My name is Wayne Warren, and I'm a senior engineer at DigitalOcean. I work on the DigitalOcean Kubernetes product and the DigitalOcean Container Registry product. I'm Adam Wolf Gordon. I work with Wayne at DigitalOcean, where I'm also a senior engineer. Today we're going to talk about container registries, and we're going to talk about what they are and how they work and give a bunch of those details. Before we get into the details, though, I did want to give a little bit of a disclaimer, which is that container registries can do a lot of stuff these days. They can store artifacts and things that aren't containers. They can have additional metadata, like bills and materials and signatures and things like that. And we are going to ignore all of that today. This talk is in the 101 track, so we're going to stick to the most basic use case of pushing and pulling container images from your registry. Before we get into a bunch of details, let's answer a really basic question. What is a container registry? To answer that, let's start with something most of us have probably seen or done before, and that's a docker push. So you docker push, some ASCII arrows fly across the screen, and your container image is now somewhere in the cloud. Later, your coworker or some random stranger on the internet or a deployment system can pull that image and run it somewhere else. So this is the power of containers, as we all know. You can share the image, and it should just run on somebody else's computer. The thing that stores those images in the cloud, that's a container registry. So we know what a container registry does for us, but what's underneath? It's a content-addressable data store, and what does that mean? It's an object store, so you can put data there, and it's content-addressable. That means that for each object in that store, instead of being identified by some arbitrary key or path name, it's identified by a digest of its contents. In a container registry, we have a few different kinds of objects. First, we have blobs. These are the basic object that are stored in a container registry. And then we have manifests, which are metadata about images and the contents of the blobs. So they tell us what layer blobs are stored in each image. You can think of manifests as a set of pointers to layers. And then finally, we have tags. So tags here are drawn as a different shape because they're special. They have human-readable names, and they're mutable. I mean, not a different shape, but a different color. Since manifests and blobs are identified by their contents, they're immutable. Change the contents, and you change the identifier as well. Tags, on the other hand, you can update later. So there are many different registry implementations. But the one we'll talk about today is a CNCF project called Distribution. This was previously the open source registry implementation released by Docker. They donated to the CNCF in 2020, and it is the reference implementation for other container registries. But it's also used by many people in production, which includes us at DigitalOcean. Our hosted container registry platform is built on top of this code base. So we've heard a little bit about what a container registry is, what it looks like inside. One sort of mystery that we haven't talked about is those ASCII arrows that fly across your screen when you do a push. How does your container image actually get to the registry and that data get into the cloud? And the answer to that is the OCI Distribution spec. So this is the HTTP API that's been standardized that registry clients use to talk to registry servers. Like I said, it's a standardized API. It reached V1 last year, so now it's really official. And it's vendor-neutral. It was originally written by Docker to go with their registry implementation, but it's been adopted and updated by the Open Container Initiative, or OCI, which is a community. So it is vendor-neutral. There are lots of different implementations on both the client and the server side. So now let's dig into the distribution spec and the distribution code base, and we'll walk through some practical examples. So starting with what a push and a pull look like from an API perspective in the distribution spec, and then we'll dive into the distribution code base to see how it's structured and how it implements image pushes on the server side. And then finally, we'll wrap up with garbage collection, which is basically the process of cleaning up or freeing up disk space after deleting images. So let's start by looking at what happens in the API in this distribution spec when you push a Docker image or a container image to a registry. Remember, this spec is standardized. It is vendor-neutral. So everything that we're gonna talk about in this section is applicable to any implementation of it. We're gonna use Docker as our sort of example commands, but that is not important. You could be using some other tool. So remember this diagram from earlier. This shows what's in the container registry. This is all the objects that need to make it to the registry when we do a push. And we talked earlier about how the manifest is a sort of pointers to these layers. It references the layers. That means that as a client, we need to push the layers first and then the manifest because the spec says that a registry is allowed to reject an object that has references to non-existent objects. You'll notice that I didn't talk about the tag because the tag actually gets pushed as part of pushing the manifest. It's sort of an additional piece of metadata that's attached to the manifest. It's not really a separate object in the registry, even though we kind of think of it that way conceptually. The other thing to point out before we talk about the mechanics of the API is that different container images can share layers. So in this diagram, we have this layer called D655 and so on, and it is shared between two different images. One that's already in the registry, the one that's on the bottom, that's full saturation, and this one that we wanna push to the registry on top. As a registry client, we can push that layer again if we want to. There's no harm in doing that other than we're gonna waste some time because the registry is going to deduplicate it. But we don't have to do that. As an optimization, we can figure out that the registry already has that layer, and then we can skip pushing it. And most real implementations on the client side are going to do this. So jumping into the spec and what the requests and responses look like, the first thing we're gonna do is check whether the layers that we wanna push are already in the registry. And to do that, we use this head endpoint. We identify the blob using its digest, which is a hash of its contents. And the registry is going to return a 200 response code if it already has that blob. If it doesn't have the blob, it's gonna return a 404, and then we know that we need to push it. Once we've figured out which blobs we need to push, we're going to actually push them. And this is a little more complex because the distribution spec gives us three different ways to do it. These three different methods all do effectively the same thing. They do three different steps, but they do it in different combinations of requests. So the three steps that we need to do are first initiate and upload. The initiating the upload lets the registry know that we're about to push some data to it, and it will return us a session ID that we can then use in subsequent requests to refer to that upload. The second thing we're gonna do is actually push the data. That's I think the obvious part, you gotta provide the data to the registry. Finally, we're going to finalize the upload. When we finalize the upload, we provide the digest that we've calculated of our layer blob. And the registry then can calculate the checks on itself, make sure that it got the right data, that what we said we were gonna send it is actually what we sent it. This finalization step is important for another reason other than integrity. It's also important to make sure that the registry isn't going to make in progress uploads available to download for clients. So we don't tell the registry the digest until the end. That way it's not even able to make that upload available to a client to download until we've pushed all the data and it knows that it's there. So each of these different methods, these three ways of pushing, like I said, does those three steps in different combinations. In some cases, two of the steps or all three of the steps are combined into a single HTTP request. But the most common method that we've seen from clients is the chunked method on the right there in which each of these three phases has its own HTTP request. So the first request is a post that initiates the upload and the registry is going to return this session ID that we'll use in the rest of the requests as a client. Then we're going to upload chunks of data using these patch requests. Again, referring to that session ID. Finally, once we've uploaded all our data with these patches, we're going to do this put request. That's where we provide the digest and say that we're done pushing the registry and then make the blob available for download. In practice, what we've seen from clients, in particular the Docker client, is that it uses the chunked upload method, but it provides all of the data in a single chunk. And the reason I wanted to call this out is that there's no capability negotiation in the distribution spec today. So there's no way for a registry implementation to communicate to a client that it only supports up to a certain chunk size or that it only supports a certain number of chunks. Clients can make really any assumptions they like about that. And the assumption that Docker makes is that it can transmit that entire blob in a single patch request. So if you've got a 10 gigabyte file system layer, you're going to get a 10 gigabyte patch request, a 10 gigabyte body. That's important to know if you're implementing a registry or hosting a registry like we are, because there's all kinds of timeouts that you might assume are reasonable that turn out not to be reasonable when you have a client making a gigantic patch request. One other option that the spec allows is called blob mounting. Everything in a registry is scoped to repository, which is kind of a namespace. That's what you might think of as the image name when you're thinking about your container images. As a client, if you know that you share a layer with an image that's in another repository, you can ask the registry to please just share that layer between the two repositories and avoid uploading it. Registries don't have to allow this, it is optional. And also the registry might not have that layer. Maybe you didn't push it, maybe it got deleted. But if it does allow it and it does already have the layer, you and you include these extra parameters on the post request when you initialize the upload, it'll return a two, I'm just making sure I have the right code, it'll return a 201 response to your request. And that means that you don't have to go through the rest of the sequence of pushing your layer. That layer is already there, it'll just reuse it from that other image. If it doesn't allow it or it doesn't have the data, then it'll return a 202 and you just proceed with your upload as normal. So this is an additional option on that post request to initiate the upload that you can do to avoid pushing data in many cases. Once the registry has all of our layers, we can push the manifest. And this is relatively simple, there's only one way to do it. The little variation that there is is that there's this parameter called reference in the URL. And that can be either a digest or a tag. I'd say that if it quacks like a digest, it's a digest and the way that a digest quacks is it has a colon in the middle separating the algorithm like SHA-256 from the checksum or the hash, which is a big long hex string you've probably seen. Registries can differentiate between the two formats this way. You're not allowed to have a colon in a tag. So that's how it works. If you push using a tag, that does implicitly create the digest reference in the registry as well, because remember everything in the registry is identified by the hash of its contents by this digest. So the tag, if you push by a tag, you now have two ways to refer to your manifest instead of just the digest, but you can push either way. So spec-wise, here's what the image push looks like end to end. First, we're going to check whether each of the layers that we wanna push exists in the registry using that head endpoint that we saw. Then we're going to push each of the layers that we need to push in this example using the three-part post-patch put chunked upload. And then finally, we're going to push the manifest using either that tag or the digest like we just saw in the previous slide. And that's it. At the API layer of abstraction, this is what an image push looks like to a registry. We're gonna go into another layer of detail in a couple of minutes, but before we do that, I wanted to look at the other side of the Docker pull just for completeness so that we introduce the rest of the end points in the spec. So as a reminder, back to this diagram again, these are the objects in the registry as a client. This is the things that we need to pull. And if we wanna pull those layers, we need to know which ones they are. So as you might expect, this is kind of the opposite of the push. We need to pull the manifest and then we can pull the layers once we have the manifest and know which layers they are. Pulling a manifest looks a lot like pushing a manifest. You remember the duck, you can use either the tag or the digest. And the only difference here is that it's a get endpoint instead of a put endpoint. Pulling blobs is much simpler than pushing blobs. It is a single endpoint. It's the same as that head endpoint that we used to check for existence except now it's a get and you'll get back the data that's in the layer. That's really pretty simple. So spec-wise again, here's the end points that are used when you pull an image. We've included the two head end points in there that you can use to check for existence. You don't have to do that when you're pulling an image but you might want to just as an optimization to make sure it has everything that you need. But really you're just gonna use these two gets. One get to get the manifest and then a series of gets to get all the layers that you need. And with that I'm gonna hand things off to Wayne and Wayne's gonna talk about how all of this stuff is actually implemented, which I think is the more complex part than the spec that I just talked about. Thanks, Adam. Now that we've covered the practical user facing elements of the distribution client server interactions, let's pop open the hood and take a look at the internals. So the internals can be roughly understood in terms of four layers. The HTTP API handlers, the OCI abstraction interfaces, which allows the API handlers and developers to, well developers using distribution as a library to operate on abstractions defined by the spec. And then there's the storage driver abstraction which allows these OCI abstractions to be defined in a way that is generic over backend implementations. And then finally, there's the backend implementations themselves which allow users or operators of distribution to choose their backend. This is of course an oversimplification that leaves out numerous features provided by distribution, but it's useful in that it allows us to roughly understand what's going on when images are pushed to, pulled from or deleted from a registry. Now let's take a closer look at each of these layers. The HTTP layer doesn't contain many surprises. It's a straightforward mapping from HTTP endpoints to handlers which themselves make use of OCI abstractions. So in an ideal world, you might expect OCI abstractions to be a relatively straightforward mapping between types and their operations. In the real world and especially in an open source project built over a long time by many contributors with different goals and ideas about how code should be structured and different use cases for that code. Things tend to get a little bit messy. So I don't wanna belabor the details too much here or to disparage the code by representing it as overly complex. Instead I just wanna give a sense of that complexity and to emphasize that in spite of it, this code can be really useful for us as container registry operators since they serve not only the registry API itself but also they can be the backend or form part of the backend for product specific features like the container registry product at DigitalOcean. So the last interface I'd like to visit is the storage driver. This interface is what enables OCI abstractions and the container registry more generally to target a variety of backend storage systems from in memory, which can be useful for local testing to production ready object storage systems like an S3 API. So there's the AWS S3 API and various open source implementations as well as DigitalOcean spaces, which is an S3 API as well. So for a registry operator who wanted to make distribution work with a new backend, this is the interface that they would have to implement to do so and then they could run a registry on top of whatever their backend is. All right, so up to this point we've talked about the broad strokes of HTTP interactions involved in image pushes and polls, as well as a high level overview of the distribution internals. Next, we'd like to make good on the promise of the talk and discuss how these elements fit together from Docker push to bytes on disk. So if you recall earlier in the presentation, we illustrated an ideal image push workflow where at a high level, the client first pushes all image layers, then pushes the manifest referencing those layers. In order to illustrate image push to bytes on disk, we'll be zooming in to examine the interface methods involved in uploading an individual image layer. More specifically, we'll take a look at the HTTP patch function for the blobs uploads endpoint. Here you can see the HTTP endpoint where N is the name of the image within the registry. Also, we've referred to it as the repository because that's the name of an image in the spec. And then S is the session UID shared across all chunk upload patch requests. So what exactly do we mean by from Docker push to bytes on disk? In the context of the blob upload patch request, what we want to do is highlight all the interface methods and objects involved in the sequence in which they're called to store a chunk layer in the configured back end. So in our case, the configured back end is an S3 API. So at a high level, the patch involves three phases, authentication, resuming the session and uploading the data. During the authentication phase, we obtain credentials from the patch request header, validate those credentials in a way that depends on the method in use. So I've listed a few potential authentication methods, basic auth, bearer token, or JWT and OAuth. After we authenticate, we can move on to resuming the session. So we need to resume an established session during each patch request in order to continue uploading a given blob to the back end. This is keyed on the session ID as a path parameter in the patch request. So first we get the blob store associated with a specific repository, which is an interface in front of the storage driver that provides blob oriented session upload semantics. We validate the session ID given in the request by attempting to retrieve a stored session keyed on the ID using the storage driver's Git content method, which in this case translates to a Git object S3 API call. And then once we've validated the session ID, we can retrieve a writer instance from the storage driver to continue the in progress blob upload in the next phase of the patch request. I realized, by the way, that this is maybe going into a little bit more detail than you might want out of a 101 session. But the slides are available online if you're interested to take a closer look. Anyway, the next phase is the data upload phase where we stream the incoming patch request body using the S3 blob writer we just received when resuming the session. And practically speaking, this is implemented using an IO copy that reads from the patch request body and writes to the S3 blob writer as bytes are received from the client connection. The IO copy will call write on S3 blob writer repeatedly as bytes are streamed in from the patch request body. So we don't buffer anything in memory here. Well, we don't buffer the entire request in memory, I should say. We do, however, buffer a little bit. So internally for our S3 backend, the writer makes use of a multi-part upload and each part in an S3 multi-part upload has to meet a minimum size requirement. And that's imposed by the S3 API itself. So because of this, the right call is to the writer that don't meet the minimum size will buffer locally until that requirement has been met before being pushed or until explicitly committed through the storage driver. So commit is a method on the storage driver, which I don't think I illustrated here very well. Also note that here in this diagram we show the create multi-part upload and the complete multi-part upload, but those aren't invoked on every patch request. The create multi-part upload is only ever invoked on the first patch request, and then the complete multi-part upload doesn't happen during any patch request. Instead, it happens during the final put in the chunked upload sequence of post patch put. So yeah, this more or less wraps up our somewhat hand-wavy explanation of how bytes traverse distribution from HTTP to bytes on disk. I call it hand-wavy because we kind of stopped before we got to the disk at the S3 API because obviously different S3 APIs can be implemented in different ways and that's not the topic here. Yeah. Now that we've covered how the bytes get onto the disk, kind of, let's consider how they get off the disk. For distribution, this happens through a process known as garbage collection. In programming languages, garbage collection is a memory recovery feature where the runtime detects, not all programming language, some programming languages, where the runtime detects unused blocks of allocated memory and deallocates them to free them for subsequent use. Similarly, in distribution, garbage collection is a disk usage recovery feature. To understand what garbage collection is in the context of distribution, let's revisit the container registry diagram from earlier where we had just pushed a new image and updated the latest tag to point at it. The old image represented here by the top manifest is now in an untagged state. While this manifest could be deleted directly through the API, untagged manifests can be garbage collected. That is, they can be detected as untagged and automatically deleted. Once a manifest is deleted, either manually or by garbage collection, it may leave behind what we now call unreferenced layer blobs. Unreferenced layer blobs are blobs which are no longer pointed to by manifests in the registry. Such blobs are now eligible for either manual deletion or garbage collection, similar to untagged manifests. After garbage collection is run, disk space is freed for use by subsequent image pushes. So garbage collection then is really a convenience feature that removes the need for the user to manually delete things that could be deleted. So one question you might be asking yourself is, why is blob collection necessary as a separate process from manifest deletion itself? Shouldn't we be able to just automatically delete any blobs that are no longer pointed to by manifests after we delete the manifests? Well, without getting into too many details because this is the second to last slide and those details could constitute their own separate talk. The reason is that there may be reads, writes and deletes happening simultaneously on a given registry and for a given image or set of related images. So because the distribution API doesn't make any guarantees around atomicity, the design of internal interfaces, particularly the storage driver, don't take the possibility of simultaneous reads, writes and deletes into consideration. So one resulting risk then is that a delete that happens by one user during another user's write could lead to data loss for the writing user because the writing user may see the blob as existing before they begin to push their image and decline to push it, or sorry, before they begin to push their layer and decline to push it, assuming it will be there at the same time that another user is deleting that layer. Thus, in order to safely delete objects without incurring data races or loss, that would lead to data integrity issues. We need to set a given registry to read only mode before we can safely run garbage collection on it. That's it for today. That's our talk about container registries. Hopefully you learned something from this or learned some things you already knew maybe if you already knew about registries. I think we do have a little bit of time for questions. Before we get there, I did wanna say we've got stickers. They're on the table there if you want some stickers, and we're also happy to chat after and I have more stickers in my backpack if we run out. They're really cute stickers if you're not familiar with digital ocean stickers. They're like little sharks in different outfits, so yeah. But if we've got some questions online or in the room here, we can get that started. Yeah, the microphone is just in the middle. Okay, so we have a question from the live stream. Cool distribution be used locally on Kubernetes nodes to import images from a mounted distributed backend. Will that speed up the import of big images? Yeah, absolutely. You can run distribution. Can you repeat as well? Oh yeah, sorry. The question was whether you can run distribution inside a Kubernetes cluster to give you some locality of where your images are. The answer is yes, that's a really common use case for it. You can run distribution pretty much anywhere and like Wayne alluded to, there's lots of different storage backends. So you could store images on local disk. You could store them in local object storage like in your own cloud or in your own data center. You can also use S3 or S3 compatible storage. So you can absolutely do that. Distribution does also have a mirroring mode that's explicitly meant for that. We don't use that for our registry. We're using it in sort of the other mode where it's the source of truth, but that is an option that's there as well. If you have to set the registry to read only, how do you like stop users not like pushing? Like how does that affect your users, I guess? Sure, do you wanna add a list? Well, that would differ. I mean, that depends on how your product is implemented. So at DigitalOcean, we have a container registry product and we use a bearer token, a JWT for authentication and they have an expiry set on them. So when a user schedules garbage collection, which happens through the DigitalOcean API, we mark the registry as read only mode and we no longer issue write capable JWTs. And then the expiry on our JWTs, I think is at 15 or 16 minutes. So garbage collection is then scheduled to begin once the last issued write enabled JWT. has expired. Thank you very much. In the generic distribution, I think it actually goes into a different mode where it just doesn't accept any write requests, but since we are serving multiple users and we want them to each be able to garbage collect independently, we have to do some things with tokens to make that work. Yeah, we can repeat your question. Yeah, about read only mode experience standpoint, so if we press push in each one, we would just make it wait. Sure, so the question was, with this read only mode for garbage collection, what's the user experience for that? The answer is that you get effectively an authentication failure, at least in our implementation of it. I think in the generic implementation, you get either an authentication failure or some other kind of HTTP error back if you try and do a write operation. So there's no backlog of requests or anything like that. It's on the user to retry it. Sure, so the question was, given that the client doesn't provide the digest until the end of the layer push, how do you handle simultaneous uploads of the same layer potentially? The answer is it would just get uploaded twice and the last one will win. They're gonna get uploaded into the same, at the end of that upload, they're going to be moved into the same location in the backend storage. So the second one will end up overwriting the first one, but they have the same content, remember, because they have the same digest, so that's okay. But there are definitely are cases like that where you're going to end up with double push if you have the same layer being pushed twice simultaneously. Does that answer the good question? Yeah. I wish, where are labels stored? What do you mean like tags? Right, yeah, I believe the key value labels in Docker become part of the manifest. I'm not 100% sure about that though, so I would have to, do you know a way in? Well, I don't know if I can answer that question specifically, but one thing we didn't cover here is what a manifest is, which is, it is itself a JSON blob. So it gets stored actually as a blob in the registry and it contains all of the information about your image, like not just the layers, but different annotations and the time it was created. Largely dependent on the runtime that built the image, so different runtimes may have different extra metadata they add to the manifest, but yeah, it's a JSON blob that gets stored similar to layer blobs. Are we using the report digest or the image ID? So yeah, so the question is when you push an image and you're using the digest, is it the image digest or the image ID? So I think the image ID is specific to your container runtime, so your sort of local container client. What gets used for the upload is the digest of the manifest. So it is the hash of the manifest contents. So like Wayne said, the manifest is a JSON file, basically. So you take a hash of that JSON file and that's the image digest. Great, any other questions or are we? Oh yeah, I don't worry. Talk about digest, so the most popular one is Shatik that said are the supported or is it just? Yeah, so the question is what digest algorithms are supported for the hash digest and there are two officially supported by the spec that's SHA-256 and SHA-512. In theory, somebody could propose another one like SHA-384, I think SHA-384 is mentioned in the docs if you read the spec itself but it's not officially meant to be supported and that actually gets into a tricky area of, because you have to verify the digest when you receive the blob, so if you're getting the chunked blob upload and you don't have the digest itself for the blob until the last post or put request, what algorithm should you choose when you're calculating that digest over the course of several patch requests? So the default is just SHA-256, that's in fact the only algorithm I've seen in use in practice whenever supporting a customer who has a problem with a container registry. But yeah, so that's the only one that distribution calculates by default. If at the end of a chunked upload, the digest were like SHA-512, then distribution will go and download the entire blob from the backend and recalculate it at whatever algorithm was specified. But you could also in theory calculate all supported algorithms concurrently as you're taking the chunked upload. Does that answer your question? I kind of got off on a tangent there, okay. All right, I think we're a little over time, so we'll probably wrap up the questions now, but feel free to come and chat with us afterwards as well and grab some stickers. Thanks everyone.