 Thank you all for sticking around. Last day conference, appreciate you all showing up. So, we're gonna talk about fast image polls using IPFS and opportunistic caching. And my name is Chris, I work at Gitpod, and with me on this stage should be Alejandro, who is actually in a tab in a Google Meet in this very browser right now, so we'll pull him in later for the Q&A session. So we both work at Gitpod, and what we do is we turn the hours and days so that it takes to set up a deaf environment into seconds. And the reason I'm bringing that up is part of this deaf environment is a Docker image that he can choose or that we can build for you. And so just to show a quick slightly sped up demo, I was just on GitHub, clicked the button, and then it's preparing the workspace, pulling the image, right? And this is the face, this is exactly the thing that we're talking about, initializing content, and then you get ready to go deaf environment. In this case, VS Code can be other IDs. And the fact that it can be other IDs will also become relevant later in the talk. So that's what Gitpod does, and this is the context in which we're talking. So we pull a lot of different images. So over any sort of period of seven days, we'll pull more than 10,000 distinct images measured by their manifest, not necessarily, so they might have some layer overlap, but it's 10,000 distinct images. Now we cache more than a terabyte of image layers, and these images vary greatly in size, anywhere between 200 megabytes to several tens of gigabytes. So it's, and we have very little control over what images our users will bring. So we have to be very thrifty to make that fast. So just to tell you a bit about where we're coming from from a time perspective where we've been heading, so about 12 months ago when we embarked on a, let's get this faster journey, P95 workspace startup times was more than 10 minutes. And our Prometheus histogram just stopped at 10 minutes, but beyond that it doesn't matter. You're gonna assume the system is broken anyways. So more than 10 minutes. And today we've brought this down to 120. That is still a very long time to be waiting in front of a computer and until you can work. So we'll keep on working bringing that down. But this is by far the largest reduction that we've been able to achieve. And a good part of this is because of the caching mechanisms that we'll discuss in this talk. The P50 startup time more than half as well. So we're coming from about 24 seconds down to about 10 seconds today. So what have we tried? Well, first of all, we tried nothing. The baseline, so Gipot workspaces are essentially Kubernetes pods and then some. And the baseline was to just use the Kubernetes mechanisms that are out there. So we would rely on the layer reuse essentially that would happen within the nodes. And that is sort of the 12 month ago situation that you just saw me refer to. Then we tried to pre-pull images. So what we would do is we would put a demon set in place that would pull well-known images. So while we do pull a large number of different images, that is not a uniform distribution. It is very, very, very spiky. So there are like five to 10 images that are used a lot. And then there are some that are used very seldom. And so we tried to pre-pull those five to 10 images that we know we basically looked at what was used last week and assumed that that was gonna be used next week and we would pre-pull that. The problem with this approach is that, especially during rush hour, so in the morning when a lot of people fire up that F environments will have scale up of nodes too. And then the demon set will come in. There are gonna be a lot of image pulls on that node happening already. So not only is the demon set ineffective, if it even exerberates the problem because it's gonna do pulls on top of the pulls that users actually are waiting for. The next step was to pre-bake that into VM images. So we run on K3S on GCP with VM images. So we just pre-baked those into the VM image so that once the VM came up, it wouldn't need to pull those commonly used images. That also worked. It worked considerably better than pre-pulling, specifically because pulling the VM image is faster than pulling a lot of individual tiny top, comparatively tiny top files. The downside of that is that we would churn those VM images very often. Like in order for this to be effective, we would need to produce new VM images very, very often. They would lose, like this would lose its effectiveness in about three days. And so that just wasn't a viable path forward. Also, it helped a lot on the P50. It did not help a lot on the P95, right? The P95 is caused by people bringing images that are not commonly used but large. And this would specifically help with images that are commonly used because we just take like the 10 or 15 most commonly used images and we pre-baked those. So we'll help a lot with P50, not with P95. And all of this then led to looking into other means of speeding up the image pull times. And so we're not alone in this endeavor. There is a lot of community effort that goes into speeding up image pull times. And this is, like, by no means a complete list and it's also not an authoritative taxonomy of ways to speed up image pull times. It is sort of the way we've been looking at it. And one is sort of distributed file systems where you try and distribute the layers, essentially, right? That's what's expensive at the end across the different nodes. So you don't have to pull them from a registry that is potentially far away and slow to reach. Then there is a lot of work that goes into lazy pulling. Most notably, Stargaze, who really kicked a lot of this off. There is Sleka, there is recently NIDAS, and then also Amazon presenting the Seekable OCI project. All very, very exciting work. And some of the concepts that you'll see presented in the coming slides are based on some of this work. And then there's also peer-to-peer distribution mechanisms, notably Dragonfly, where the lines in the Sixonomy blur bit also towards distributed file systems. In Gitpod, we have a component that we call registry facade. And this will be central to how we implemented this caching. So let's talk about that for a second. Earlier I mentioned that you can choose different IDEs that you can run within your workspace, right? It doesn't just have to be VS Code, it can be JetBrains products, for example. And so we require a bunch of layers to sit on top of what a user configured. And it's the IDE layers, and then there's also some Gitpod-specific tooling that we require within the same file system of your workspace. And we could just build out those combinations, right? So whenever we update our own images, or whenever there are IDE images, we could just multiply that out with the images that users have configured. This would lead to an explosion, quite literally, of images that we need to take care of. It would also not provide a very good user experience, because whenever we update one of our images and we'll deploy anywhere between one to 15 to 20 times a day, you'd have to rebuild this every single time. So instead, registry facade is what workspace pods actually pull from. So the image that is used will not be directly pulled from some remote registry, but instead it will point to an instance of registry facade, usually running on the same node. And then registry facade will dynamically assemble the manifest to be pulled. As part of this dynamic assembly, and this is where IPFS comes in, we can add caching mechanisms to speed up the subsequent pull of the individual blobs. And also that manifest generation, like the stacking, this production of a manifest itself, takes time, so we wanna cache that as well. So I'm broadly gonna assume you all know what IPFS is. I'm by no means an expert on this, so I will not claim to be. Broadly speaking, IPFS is a peer-to-peer-based distributed file system. And for the context of this talk, that's good enough a description. There is much, much more that goes into it. And I wanna be aware of that depth without needing to go into it. So what we did then is we embarked on the journey of trying to speed up image pulls. And what we're gonna show here is work that was specifically done around April, April to May, earlier this year. So some of the statements about other projects that we'll make might no longer be true. And if any of the maintainers of those projects are in the room, I very much welcome them to speak up in the Q and A session and bring corrections to what I'm saying. So the very, very first thing that we did is we looked into NerdControl or NerdCuttle IPFS registry. And what NerdCuttle IPFS registry does is it is a registry very much like our own registry facade that you can run, for example, on the same node, and that you can pull from. And the way you pull from it is you address the image using an IPFS content ID, using an IPFS CID. So instead of doing, I don't know, GCR or something, something, you would do a local host, 5050 if that's what your NerdControl IPFS registry runs on slash unwieldy content ID, and then it would, that would point to a manifest, and it would pull from there. So in Kubernetes deployment, you would then use something like that as an image. This got us, sort of got our feet wet with IPFS. It was not the solution to our problem. First off, you need to know the CID, the content ID, and the way you would usually do that is you would use NerdControl IPFS push, and you would push into that registry, and that would give you the CID that you could then later use the reference. And at the time, there was no content distribution across nodes, because the example did not incorporate IPFS cluster. This has since been fixed, so thank you very much to the maintainers who added that to the examples. So this got our feet wet with IPFS, and it showed us that there is a way how we could distribute that content using all those blobs using IPFS. So we started introducing this with registry for start, and the first attempt was to use Stargaze, the Stargaze snapshotter. And so the Stargaze snapshotter has a feature where if the OCI descriptor of a blob contains an IPFS URL, it will pull from that, right? Something that say your regular overlay snapshotter could not do. And so the idea was that we would use registry facade, and if a request comes in, we'll get the Shah of the blob out of Redis to translate that into the content ID of IPFS. And if that blob was not present, then we would have registry facade take that blob, put it into IPFS, stream it as it was being downloaded by the consumer, and then set that Shah hash to the CID, to the resulting content ID. And this is the opportunistic caching part, right? We will basically start caching a blob the moment it is used if we haven't cached it before. With that translated CID then, we could make the corresponding entry in the OCI descriptor and have the Stargaze snapshotter pull that from within IPFS. At the time, there also was no content distribution across nodes, and there was some limitations that meant that we could not use IPFS cluster. And also pulling non-Stargaze images failed. So we would have needed to Stargazeify all images in order to use this, right? And that is something that we could have done, and it's quite interesting to look at the lazy pulling aspect of that. But it is not the path that we wanted to go down. Also, it would have required a considerable amount of excess compute just to do that given the amount of images that we need to handle. The next step then was to bring that one step closer and basically replace the Stargaze snapshotter with registry facade itself. And so the way that would look like is if that get request comes in, for the manifest we would check if the individual blobs already exist, much like before. If they didn't, we would then upload them into the IPFS cluster. The key difference now is that we also added IPFS cluster. And so what that does is we have a bootstrap node that exists as a deployment within the cluster. And then on each individual workspace node that also the workspace is run on, we have registry facade running and we have IPFS node running, right? So that all of this is already node local, ideally, and IPFS distributes the individual blob content to those workspace nodes where that content will be needed so that we don't have to pull across nodes a lot. That was the idea. And this actually really helped improve image pull times. However, the interzone traffic that that caused was very expensive. So we would span anywhere from like up to 50 to 100 nodes on a given day from close to five up to 100 and then back down again. So there would be a lot of content redistribution happening within this IPFS cluster. And there's no awareness at this point of availability zones so we would be paying interregion traffic for this sort of distribution of content that IPFS was doing. On top of that, the IPFS nodes lift on the workspace nodes, right? Where the actual workloads happen. And so they scale up and down considerably throughout the day and as they scale down, we're gonna lose content that we previously had cached. And so our cache miss rates would go up as our cluster would scale down. And the next day we'd essentially start from zero because we would have lost most of what was cached a day before because those nodes had scaled down. So the fourth iteration, and this is what we operate today, makes this whole thing region aware. So instead of running the IPFS node and cluster proxy on the same nodes as the workloads, we run them in a separate setup. And at this point we operate in three availability zones. So we quite literally have three IPFS nodes running, which far scale currently works. They come together again using a bootstrap node and we use topology aware hints on the IPFS cluster service to route the request that a registry facade would make to the node that is nearest to it. And so we're still paying this interzone traffic for the redistribution of content within IPFS, but it is much fewer nodes and those nodes nowhere nearly scale up and down as much as they used to. And so a lot of that traffic cost problem is gone. And then because we use topology aware hints to talk from the individual registry facade instances to the IPFS cluster nodes, also that traffic cost is considerably reduced. So how does that look like in production? What does it actually do? And when we look at the download speeds, you'll see the graph on top, the green line is the download that's happening from IPFS. This is cumulative, so this is not just one node. And you'll notice the yellow line in the top graph, if you can, I took that out and put that in the lower graph. So this is the exact same graphs, it's just one doesn't have the green line. And what you'll notice is that step change, this auto of magnitude change in download speeds. The other thing to look at is the download request rate, i.e. what are we actually pulling from? Is registry facade proxying to an upstream registry or is it pulling using its own IPFS cache? And what you can see here is for one, sort of this peaking behavior and that corresponds to the scale and to the load that the system sees throughout the day. And then again, the yellow line is proxying to an upstream registry and the green line is pulling from IPFS. And I zoomed in here, what you can see is that there are some cache misses essentially, which is when we proxy, right? When we had those spikes on the yellow line, but the vast majority of it nowadays comes through IPFS. So to the tune of 90%. So just to give you a rough idea on the traffic, pulling from these IPFS nodes is to the tune of 2.3 terabytes a day, as compared to proxying up to upstream registries, which is about 100 gigabytes a day. So again, an auto of magnitude change in where we would pull from. So in conclusion, we gained considerable startup time improvement specifically through the reduction of bandwidth requirements towards upstream registries and by bringing that data much closer to where it's needed in a very opportunistic fashion. So we do not actively pre-warm caches and we just rely on what users pull and we will add that to the cache assuming that it will be needed again. There is a time-based eviction mechanism that I have not spoken about. Essentially, if a blob is not pulled for a given amount of time, it is just gonna be thrown out of the cache again because otherwise we'd be spending too much money on this space, essentially. A lot of this work is heavily inspired by what the community does. So we stand on the shoulders of giants and I want to recognize that. Very grateful for all the work that the community is putting into this topic. And then the idea that presented in here, although reasonably specific to the architectural choices we made within Gitpod, are transferable. So the idea of adding a pull-through proxy and you could quite literally make that an HTTP proxy that you add to your container runtime config or you could use something similar to how NerdControl registry mechanism works and add caching mechanisms tuned to your own setup in a very similar fashion. And with that, thank you all for your attention and happy to take questions. For the questions, I will put my co-speaker on the screen and unmute. Let's hope that that audio thing is working. You all say hi to Alejandro, who unfortunately cannot see you, but... So as someone who uses a cluster that isn't at this scale, is there still benefit to looking into this IPFS style of pulling or is it just something you'd use when you're dealing with multi-terabyte? So I think the main benefit, you'll see the main benefit not so much from the amount of data, really, but rather the variance of data, I wanna say. So if you keep pulling the same image over and over again, the layer reuse that happens on the node will likely be sufficient. If it is a large variety of images that you need to pull, over which you have no control, then something like this will be useful. The moment you can reasonably predict what it is you're going to pull, chances are you'll find more efficient ways of pre-warming caches. Hi, great talk, by the way. Thank you. So one of the aspects you mentioned that this helped with was reducing network bandwidth in a way or reducing load on the actual registry because you're mostly pulling from IPFS. Did it have a latency impact for cases where you missed the cache, for example, because then you essentially have two hops? I'm not 100% sure I fully understood the question. Let me try and answer it and then you tell me if I did. Sure. What I understood is what happens in case of cache misses and what's the cost of that? Yeah, pretty much. Okay. The cost of a cache miss is essentially pulling from the upstream registry. It is very cheap because we will detect the cache miss like we assume that what's in Redis and what's in IPFS is on parity. And so the check if something is in cache is essentially Redis get, which is a very, very fast operation. And if it's not there, then all we'll do is we'll just pull from the upstream registry and add it to the cache in the process. So I saw the, you kind of have written your own shim in order to either go to your cache or go to the source. As there any consideration of like layering that down to like into container D or into CRO so everybody can kind of use this without needing a software shim? Yeah, that's a great question. So we ourselves had not considered doing that predominantly because of the deployment modes that we see at Gitpod where sort of upstream adoption of something like container D is very slow. So we have not actively engaged in this. However, if there are container D maintainers out there who'd be interested in collaborating with something like that, that'd be awesome, yeah. Hi, maybe I'm missing a basic concept. What is the unit of downloaders? Are you expanding the layer into IPFS or are you pulling down the layer from IPFS? Both really. So if it's not present in IPFS, we will take that tar file and just push it into IPFS. So the next time we need it, we can pull it from there. The layer or the files, I'm sorry. So we'll take the tar file as is. We don't extract that in any form. You're not expanding. So when the pod wants to read a file that's in a layer, they're getting, you're not providing a file, you're providing the overlay and then that gets dealt with. That's right. So there is no lazy loading that we add here. If you wanted something like that and your images are all stargazed, so to speak, or also with SQL OCI, you could do, those two things are orthogonal, right? You could combine the two. We just opted not to. Cool. Thank you. First of all, thanks for a great talk. Wanted to ask if any of the files that you're using, are you guys open source? Is that available or just the talk? So let me rephrase the question just to make sure I understood correctly. The question is essentially is what we just showed open source, right? So it is the, all the code of what I just presented, can be found in our repo, github.com slash githpod-io, so slash githpod, and they're in the registry facade component. You'll find the code that I spoke of. HTPL licensed. How does this compare with running a registry, like in your organization and having like upstream replication rules and maybe potentially depending on S3, like that's how we do it. We have Harbor. Like, yeah, I'm kind of curious. Yeah, that's a great question. So we also pull from like the upstream registry that we would pull in a cache miss case is also one that we control. It's essentially Google artifact registry. And we have still found that by caching this very close to the node, we do see considerable speed improvements. So I would veja, and this is speculation at this point, that the most benefit we see here is from the locality rather than the replication itself. So if you could build your registry in such a way that it would bring the data closer to the nodes, i.e. increase the bandwidth of those poles, then you would probably see comparable results. So yeah, I think that the key mechanism here is exactly increasing that bandwidth. So whatever means you can do that will probably do similar things. Yeah, so the idea of caching helps when the same images are pulled multiple times on a given node. For newly provisioned nodes, have you looked into using network-assized storage to host all the images on a volume and I know it doesn't work with the NFS, like overlay FS doesn't work with NFS, but have you explored options like that? We have not explicitly, the main reason is because we did not want to dive too much into how the nodes themselves operate. Again, we're coming back to GIPR deployment models and sort of the degree of control that we have over the infrastructure that we run on. So we did not want to sort of assume, for example, that we can meddle with container-based content store or something to that tune. Great, I had a question about authentication since you're sort of sitting in front of the registry and potentially serving images. How are you doing any off when the sort of, the allocation would be at the registry which is kind of beyond you? So there is a policy on what registry facade can pull, what is allowed to pull. In the end, the authentication, like registry facade itself is just authenticated against a single external registry and that's where it will always pull from. So it's not sort of authentication transparent if that makes sense, which is really an artifact of how registry facade is used within the context of GIPR pod. I'm certain that this context could be expanded further. So I do remember when Docker introduced cost on manifest polls, we had saw a bunch of pull-through proxies popping up, notably from Alex Ellis, for example, who wrote one and he basically built a transparent authentication path through which he would need. We just opted not to build this out of the architectural setup, but conceptually, that would well be possible. Yeah, pretty cool, thanks for introducing this. Do you benefit from the IPFS peer-to-peer distribution like if one of your nodes contacts the IPFS in its AZ and it's not there? Does it pull from the other nodes? If it's in another AZ or does it go to the upstream? Like, is that automatic or did you have to write a lot of stuff for it? That is a great question, which I actually cannot answer. I would like to defer that to Alejandro. I'll repeat it for Alejandro. The question is, what if there is a cache miss within an availability zone? Will that node talk to, like that IPFS node, talk to another IPFS node, or is it just a cache miss in the end? At the end, it's just a cache miss. We don't implement any rate miss in the IPFS. Not sure that came through, so I'll try and repeat. It's just a cache miss. On the same lines, you have three different bootstrap nodes, right, for each availability zone. Can you talk more about the reliability of the IPFS system itself? Like, what if those nodes go down? Who, how is that being monitored, maintained, and all those things? So I did not understand that from an audio perspective. The bootstrap nodes that you have, right? So there is one for each availability zone where the nodes actually contact the IPFS nodes. What is the reliability of that IPFS node? What if it goes down? Who maintains that, and how would you take care of that? Yeah. So to rephrase the question, to make sure I understood correctly, is what if an availability zone goes down? What's the impact of something like that? And it's likelihood. So the impact of it would be a loss of caching in that particular region, which would be a degradation of service, and we would treat it as such. At this point, we have not seen that. So the nodes are, they cycle, right? They cycle regularly, and so we'll just have it re-initialized with the other three nodes. So if it goes down, for some time, there just isn't caching in that region that's available. And then when a new one comes up, it will use that IPFS bootstrap node to make itself part of the IPFS cluster and get the data replicated from the other nodes. Hi, just a quick question about the dynamic, dynamically assembling the images that you're discussing a bit earlier in the talk. That was blowing my mind a little bit, dynamically creating manifests out of different layers of images. My question is if you're dynamically assembling manifests each time, because some of your GitPod specific or your IDE layers are changing, does that mean in the end somebody with manifests deploying to the cluster are they able to use pull by digest or is pull by digest kind of out of the window because you don't know what manifest you're really getting? How does that work? That's right. Identifying that by digest goes out of the window because you don't know what the content of that digest will be. So you basically have to pull by tag. And in fact, we do use the tag itself to decide on the content of that particular manifest. All right, we'll also smack on time. Again, thank you all for showing on this Friday morning and thank you for your attention.