 All right, let's get started. And my name is Tao He. I'm a software engineer in Google. I need teams to improve the startup latency for machine learning workloads. I also need teams to provide the GPU runtime in GKE. And today we are going to talk about reducing AI job code start time from 15 minutes to one minute. And first we will talk about the code start problem. The definition is here. The code start is when you schedule a pod and to a node. And when the node doesn't have that container image. During this process, the container will pull the image from a remote container registry and download and pack that container image to the local disk, which will consume some CPU and networking and storage, which also takes time to finish. And this graph is a high level when you schedule a pod. And each step, it will require some resource. For example, in the infraside, you may need a node pool in GKE. You may need an instance group in AWS. And when you don't have enough node for the same machine type, it will trigger auto scaling. And to get enough CPU, GPU, or memory. And when it comes to the container runtime, the workload part. And if the container image does not exist there, it will trigger the image pool. And it will pull the image to the local and pack it. And then the container can start from that container image. And you need the AI model to present. And to start the machine learning workload. In this talk, we will focus on the container image pool because that is the most time consuming part. And when you have like 30 gigabyte container image, the image pool can take up to 15 minutes. And it's a super long time to start an application. And the problem get even worse when you're running a training job because if you hit some error, you need to do error recovery from the NASA checkpoint. At that point, you also need to schedule a group of parts onto a new node. And then it will delay further due to the code start latency. First, we would go to a deep dive into why the image pool is so small for AI workloads. Because AI container images are very large. Username is larger than 4 gigabyte. And why are those AI container images so large? Because those base layers are very large. And why are those base layers so large? Because AI machine learning GPU libraries inside those layers are very large. And then comes the next question. Why not place in the libraries onto the host and put it as part of the disk image? It is because the application or the framework, they have version dependency on the library. So each kind of TensorFlow or PyTorch, they have a specific version dependency on the CUDA. And container actually solves this problem, the dependency problem, by putting all the libraries into the container. And you can run two different ports side by side on the same node. And they're using different libraries. And the next question is, can users reduce the size of the libraries? Actually, no. Because not all of those libraries are open sourced. And building them may not be customizable for the size. And those libraries are required because they are hard dependency for using GPUs. And if you take a look on the right side, and this is a node. Inside of the node, there's a port. Inside of the port, there's a container. And inside of the container, there are multiple layers. And from our investigation, the CUDA libraries in those layers are extremely large, like each of them is 4 gigabyte. But the GPU driver on the host, they are relatively very small. It's only 22 megabyte. Or the driver color module is 51 megabyte. This is even before when we open sourced the module. Now it's even smaller. But the libraries, they are very large. So we did an investigation to look like an investigate. That is applied to all the containers. We are using this tool called dive. It's a great tool. You can see each layer and what's files in each layer or the size of them. And all the pre-built machine learning containers from different providers that share in the same pattern, the CUDA layers, or CUDNN, or PyTorch, they're all very large than applications without GPU. So that's a problem. And here's the requirements for our solutions. The requirements need us to provide solutions for when speed, scalability, and cost efficiency. For the speed, we need to utilize higher throughput provided by the disk or the networking. For the scalability, we want a solution to apply. We want a solution applies to larger container image. And ideally, even when the container size increase, the image pool can be skipped while the image pool type can be constant. And moving on to the third one, it is quite possible. All the nodes in the cluster pool in the same image at the same time, they will use up the egress networking limit for the container registry. So we want a solution also can solve this problem. And in the end, it costs efficiency and no extra cost in building. Because technically, you can solve all the problems by pre-provisioning all the nodes, omitting the app, and download everything you needed, and put it in standby. And all the problems solved by using extra budget. But that is not practical. You still want to save money in the training workloads. And here is the design space, giving the background of large container image, large number of nodes in the cluster, and to avoid repeatedly image pool or reduce the image pool latency on each node. The design space, there are two categories. The first one is cluster-wide solution. It's quite straightforward. You want to put those images, create a mirror, create a P2P algorithm inside the cluster much closer to the nodes. And then when the nodes do image pool, it's faster. But our focus of this talk is focusing on the per-load solution inside the container runtime. Because when you take a look on the container idea, all the containers are downloaded to each node repeatedly. And the first impression is, why do I need to repeat that again and again? And so that's why we take a look inside the container runtime, especially inside the container D. And here comes our solutions. The first solution is preloading container images through additional disk. And the goal is to completely skip the container image pool and get a faster code start. There are three steps of it. First, you build the disk image with all the preloaded containers inside it. And step two, you create the load pool or instance group from this disk image. And when the other nodes created, it's coming when the container is preloaded. Step three is when you schedule pods as a user, you don't need to change anything. It will automatically skip the container image pool. And you will get the faster code start. And I will take a closer look into each step. Step one, it is downloading and unpacking containers and create a disk image. So because if you don't do this, this step will be done at each node. It's the same process on each node without any difference. So that's why we do this ahead of time and prepare it for all the nodes, do it only once. Step two, when you create the nodes, you input the disk image. And as you can see, this node is coming with additional disk. This disk is coming with preloaded containers. And out of a box, those containers are ready to be used. At step three, we modify the container D by plugging a snap shutter. So this snap shutter is able to read both disks. One is from the both disks. It is there before. And another is the disk when the preloaded containers. This is our solution. And with this approach, a container at runtime is able to read both multiple disks and reading the container cache from it and save the latency for the start of time. And another highlight here is container D will reuse cache by layers. So if containers are sharing the same base layers, it can benefit the port start latency for those containers. And you can just put base containers like the library container or the framework container onto this additional disk. It will benefit all the applications using the shared libraries if you don't change those base layers. So you make minor changes on the application. For example, changing some parameter, changing some mega parameter. And you can still benefit from this approach. And about the results, the predicted results, the right line is what it was before. The blue line, it is this solution. The comparison is the latency for image pool to the port ready. And you can see our solution is more scalable when the container image size. Even when you have larger image size, it is still a constant time to get the port ready. The key insight behind this when you create the disk from the disk image, it is actually reusing the same data blocks underneath. It's a distributed storage. And it does not require data copy per disk creation. When you create a disk, it's just a pointer to the underneath data blocks. And by the way, the data replication gets scaled up when you increase the usage on the disk, not by creating more disk. So that's the key insight for this approach. And that's the first approach. Then we will come to the second solution. It is motivated by our investigation into container D. And we did a lot of benchmarking or observation inside container D. There are two steps, two phases in container D. First is fetch the container image. So container image is stored as a layer by layer and compressed packages on the container registry. So the download in container D is in parallel, downloading all the layers at the same time, putting it into the disk. And the first two is unpack each layer and create the snapshot, which is a file system inside of the container. And the unpacking part is single-threaded in container D. And also the decompression part. This is very important. And the question is, which step in the image pool is bottleneck? So the answer is unpacking is bottleneck, especially for large containers with more layers. You can imagine you have 30 gigabytes. Downloading and unpacking requires networking, CPU, and disk to create the container's file system, all the files, all the folders. And question two is, is it limited by CPU or networking or storage from our investigation? In most cases, it's limited by the slow disk operations. And that's because all the cloud disk implementation is almost based on the block storage technology. And this technology is actually a distributed storage thing. It requires networking. And you're publishing the throughput for that disk. And the default throughput limit is very low. And even if you increase it, it may not benefit. Because those block storage can handle more parallel operations. But it will consequence relatively higher latency than local disk. And it has post-accounts. But for the container disk implementation for unpacking, it does not benefit from that high IO ops. So in order to utilize all the support in the block storage, you require very deep IOQ depth for the system core of IO ops to get the maximum throughput limit per region for your disk. And our solution is summarized in this slide. So personally, the container downloaded everything in parallel. But I'm packing the layer by layer because there's a dependency between layers. Some files, they're overlapping between layers. You need to override or remove layer by layer. So it's possible the last layer can de-need the files in the previous layer. So that's why the container has a sequential order for the unpacking. And it needs to adopt a lot of file system types. It is another reason. And for our proposal, we propose unpacking each layer to different folders in parallel. And we build the snapshot by using lightweight file system operations like move or rename or creating mount. And as you can see, when you're paralyzed for the unpacking jobs, you can get benefits for the latency. And this improvement is more obvious when you have larger containers or you have more layers for the containers. For example, if you have more layers, you can run more parallel jobs to unpacking them. And typically, all the AI and machine learning workloads, their containers have over 30 image layers. And the predicted results from this one is the right line is still the previous approach. The blue line is the proposed solution. And when you upgrade your disk support in GKE, it's upgrading the disk size. You pay more money. And you get more benefits by the upgrade. But the previous approach doesn't do that. That's because the single-sided unpacking. So when there's approach from our benchmark, the image pool for 6-gigabyte container can be reduced from over 230 seconds to 40 seconds. And this is solution 2. And one last solution is a matter of one. It's already merged into Kubernetes. And we implemented the maximum parallel image pools across different parts. And with this parameter, we can control and enable the image pool between different parts running on the same node. And this can also improve the startup latency for multiple parts running on the same node. And that's the end of this presentation. If you have any questions, you can ask from that mic. Hi. That's great. Great talk. Really good improvement performance. For solution number 2 on unpacking them parallely, you said you built a snapshotter. Is that snapshotter available for anyone using container D to do that? Solution 2 doesn't have a snapshotter. It is improving the curl of the container D. It's part of container D. So container D by default will do this? It is a proposal in container D, and it's not yet implemented. OK, cool. Thanks. So my question is also follow up. So for solution 1, where you modify container D to create the cache layers from an attached disk, as opposed to from a local disk, is that change also merged to container D, or is that to be implemented? It is very specific for each cloud. And we can share the idea, but the implementation is inside the GKE, and it only depends on the implementation of the adjacent disk in Google. OK, is it turned on by default on GKE now? Not yet. We are going to launch it under this year and next year. OK, thank you. Are you planning to contribute the change to other hyperskip to the community, to container D community, or there are no plans for that? We will open source the image building part, but we don't have a plan to open source a snapshot here. But it only applies to Google's approach. OK, thank you. Hello. So for solution 2, have you tried different compression formats for the layers, and seeing if that has any effect, or I guess for the image as well? Yeah, it's a great question. And I think it's a different dimension. And we've tried the ZSTD algorithm. It do improve a little bit for the image pool. But still, it requires downloading and unpacking. Hi, my question is about solution 1 here. So in this image, you're not including the time to create the node pool with a disk from the image, which makes sense. But I'm wondering if you were to factor that in, like in a kind of cloud setting where you're provisioning nodes dynamically. Is there still a perceived benefit to the first step of building the disk image with the preloaded containers? And the question is, why not include the node provisioning? Mostly just like even if you were to include the node provisioning, would there still be a benefit to this pre-creation of the disk image? Yes, because the idea here is, the node creation time doesn't change. But when this approach, you create the node, the node coming with preloaded containers out of a box. So there's no extra time cost in it that you can benefit from the cache. I see. OK, yeah, I think the one maybe question mark I had was around like if this was ultimately, like if most of the cost was still coming from the download step, in which case pre-building the image versus downloading it at runtime wouldn't make a difference. But you're saying it would. There is no, let me think about it. So the node provisioning doesn't change, right? The node is still here. The machine time is still the same. But the provisioning disk is super fast if you already have the disk image in that location. So you don't need to copy the data. It just, on the cloud, you create a disk. It's just a pointer to the underneath data. So it's super fast. It's like within a few seconds. I see. So there's no actual data copying during the creation process. I see. But what if your image lived in like, you know, as an AMI in Amazon or something like that, we're potentially having to download it into the cluster. Yeah, it's the same. I'm sorry, Rosa. AMI is, it can also benefit from this. OK, cool. Well, thank you. Nice talk, this is Yuan Cheng from Apple. I have two questions. The first question about the cost, yeah, this first solution and the preloading or prefetching the image or snapshot to additional disk. Will this increase the cost? It will increase a little bit. It's relatively almost doesn't exist. So it's just a disk image with that size. And you attach a disk to each node. The disk will cost some money, but when you run machine learning workloads, the GPU and the machine, they are most major part of the cost in the building. Adding one disk is relatively smaller part. I think someone may have some comments on that. Well, I'm trying to merge that fix into a content ID. And yeah, that's it. Any other questions? Yeah, so regarding the cost, yeah, just this slide. And I don't quite understand how does the parallel unpacking reduce the cost? So what's the baseline? You compare there, right? Not only improve the performance, reduce the latency, but also reduce the cost. Can you elaborate a little how it reduces the cost? The parallel unpacking, yeah, the right one. Does not reduce cost. The cost is the same. But to your next slide, I think you should write the blue line and not only and improve the performance, also the cost, right? I need to clarify on this graph. The right side is only applies to the yellow line. And the left side, y-axis applies to the red and blue line. So it does not reduce cost. It reduces the latency. OK, OK. When you increase the cost on the disk, you can see more improvement on the new approach for the latency, comparing to the right line. OK, got it. That's how I read this graph. OK, the second one is about do you think this prefetch, preloading, and approach can work for the cluster auto scanning if I want to increase and add additional new nodes and provision it? And so how should I apply this approach to that? It applies to auto scaling. And that's a great question. And it's an API on the load pool or instance group. So when all the new nodes get created, it will come in after when the disk attached. So it's the same nodes. And just adding more nodes when there's more disks. OK, thank you. Thanks. Hey, I started a quick question about how do you manage the lifecycle of your disk image? So you mentioned that the disk image is pre-baked with the model. What happens when you want to update the model and what's your process around it? Yeah, that's a great question. I think it's a limitation for this approach. It's not that easy to change the disk image. You need to build it again with a new version of Peuda, a new version of PyTorch to get this cache template. And every time you upgrade, you need to upgrade the disk. But it happens in the building time. Is any integrated when you say, I say the workflow to re-create the disk image? And it can be part of the infra teams responsibility. So we have a clear separation for the responsibility. Thanks. And on the application team, they don't need to change their containers. Makes sense. The other question I had is around the unpacking that you showed on a slide. So you mentioned that you're making the proposed changes in the container D in how you unpack them in parallel. Is the output after the proposed change going to be the same? Or is there some, after using the proposed change, is there any other trade-offs that the application will have? It is the same. And yeah, that's it. It is the same. Thanks. And we're running out of time. I think we can answer one more question. OK. OK. I'm a lucky one, I guess. So yeah, thanks. Very good topic here. I just have one question. Maybe just auto-curricity, like thinking about disaggregated storage situation, remote storage back end. What do you see like these techniques here, parallel-proloading, like we are affected in this scenario? Where it have? Or have you tested that? Like can you clarify which combination I should test? No, just for the whole server, like the node, the back end or the node is using remote storage. And the image is kind of already there to compare between. You needed to pull some of the registry? Yeah, we don't need to pull it again because it's a format in container D and it is compatible between different container D versions. So will it still need that amount of time? Can I again? Will it still need that amount of time, like even we are using a remote storage to clarify all the disk in cloud that remote storage? So we don't change that. It's still the same. OK, got it. Thank you. So you can imagine all the system calls to the disk will be converted into an API call to the back end distributed system. You create a file on the system call and it is actually created from the back end of a distributed system. Oh, thank you. My last question. Yes. Yeah, we are running out of time, but I'm the last idea. Yes, I have two questions. Like for regarding the first approach, have you tried with the NFS protocol instead of iSCSI? It seems like you are attaching every disk to the node. So I think instead of using iSCSI, like using the NFS instead of iSCSI, you will mount the disk from anywhere and you can reduce a lot of time to do it. I mean, we don't need to take the snapshot per external disk space. Can you comment by using what? Like NFS. Like for instance, we can mount some kind of NFS storage instead of the external disk per each node. I think it is feasible for on-prem environment, but for, I don't know, I haven't tried that. But for instance, AWS, they are supporting EFS, which is the NFS file storage disk space. So I think it would be better to use the NFS storage, which is equivalent to the AWS EFS, I think. Oh, I know that one. And yes, in Google, we also have similar approach, but the disk implementation is faster than FS in Google. Because it is a native disk implementation inside this cloud. And we have done a lot of improvements for that approach. But in NFS, it can also apply. And there are other approaches, which is like GSFuse can also apply, but it's still slower than this one. So for AWS use case, try NFS, which is the EFS, which could be faster than your approach in step one. I think the idea can apply, but I haven't tested that yet in AWS. Yeah, cool, thank you. And your enhancements, which Kubernetes person will be available with this enhancement? Is the latest Kubernetes will contain these features, or? I think I'm testing with 126. Oh, good, thank you. I know we're running out of time, so I hope this is a quick question. So my question is, what events will trigger the image downloading for the first option? So you said you have a disk that's dedicated for image pooling, right? But what will trigger the image pooling? Because that should happen before the pod was scheduled, right? Yes. So the question is, does the image pool happen before the pumpkin? Yeah, before the pod scheduling, right? Actually, if the question is, does the image creation happens before the pod scheduling, the answer is yes, it happens before. You know what to deploy, you need to prepare for it, like prepare the infra for the workloads. So what's the event that triggered that? Because the pod YAML gets thought to the ETCD, and then there's a controller that watch the pod controller actually watch those. Oh, the actual workflow will be more like the application team tells the infra team, I need those containers to be preloaded, please put them into the contract. So it's a separate process. It's a human manual process, not automatic process. It's more like CICD process. Oh, actually, thank you. Great, and thanks. Thank you everyone for joining us.