 Hi everyone and welcome to this session on scaling up without slowing down, accelerating part start times. We're so excited to see so many of you here today on this Friday afternoon and seeing this overflowing room is really thrilling. Thank you so much for joining us. I'm Ganesh Kumar Ashoka Vardhanan, a software engineer at Microsoft on the Azure Kubernetes service team. I work on Node Life Cycle and GPU workloads. I also developed this feature called Artifact Streaming, which speeds up part start times on AKS in close collaboration with the Azure Container Registry team. With me today is Yifan Yuan from Alibaba Cloud Storage team, and he's a major maintainer of the Ovalabd open source project, which we use for AKS and many other companies use as well. Have you ever had the situation where you're waiting to book tickets for your favorite concert or sports event? Maybe it was a Taylor Swift concert and you go to the website and you're trying to book the ticket, but the server crashes or you're not able to log in and make progress. That's often the case when you're trying to suddenly have many people book on a website or the load increases a lot. And that could be because of many reasons such as network congestion, storage throttling or issues with containers being able to make progress and crash roof backups. It could also be due to slow node scale up. That leads to a poor user experience and a developer experience. So in this presentation, we're going to talk about the common approaches you can use to speed up part starts, explain why part start takes so long. And we'll also share high level approaches to reduce part start time and talk about the challenges that happen at scale when you're trying to scale up quickly. We'll also show a demo with a challenging LLM based workload and also talk about our experience integrating this in production and the various state offs that we have to consider. So what are the common approaches to address this problem? Most container run times like container D already have a feature called layer sharing. As you know, a container image is made up of many layers and these layers need to be pulled and decompressed. If the layers are already present on the host node, then container run times like container D don't need to pull it again. There's another option through Kubernetes from KITS 127 and there's a flag called serialized image pull. And if you turn that flag to off or false, then you can parallelize the pulls on your node. This is especially useful in cases where you have multiple pods scheduled to run on the same node and you need to make progress simultaneously. The other approach that many teams use is to reduce the image size. They carefully look at the Docker file to reduce dependencies or they use tools like Slim Toolkit to reduce the content in the image programmatically. For teams and products which are very latency sensitive and you are okay to bear additional costs, then the approach is to pre-download the images on the node itself. This is known as having warm nodes and what this is is you create extra nodes in the same node pool and then schedule with CRDs like the open-crew CRDs to download the image before it's needed. And then when your replica needs to be scheduled on that node in the future, that already has the image pulled and you don't need to pull it again. This will reduce the time significantly because the image contents are present already on the node, but it also increases your costs. So the only thing accelerating in that case would be your cloud costs or your compute costs. And that's even more important in the case where you have GPU workloads because GPU nodes are hard to come by and they're very expensive. So maintaining this warm pull is hard in terms of scalability. So why does it take so long to start apart, especially when the container image is large? Most of the time is spent on the image pull phase. As you can see for this 1.3 GB image, a significant fraction of the time is just spent on downloading the image content and decompressing each of the layers and less time is spent on other steps. There's research that's also found that only a small fraction of the image is needed to start. There's a paper which showed that about 6.4% of a container image on average is needed to start running a workload and doing meaningful work. And in some of these examples too, we see that only a small fraction is needed to get to a ready state. This is the core observation that many open source solutions use to address this problem. So you can download what data is needed when it is needed. Open source projects like OverlayBD, Sochi, Nidus and E-StarGZ are some of the main projects that address this problem. And the main differences between these projects are the type of block device or file layer that they access and the way the image is converted and the decompression algorithms they use. But all of them rely on this core insight. So in AKS we use the OverlayBD open source project and this is a container D non-core project and it integrates well with the container D snapshot or the remote snapshot of a container D. OverlayBD stands for Overlay Block Device and it is a layered block device image format which allows for seekable online decompression and provides layers as a sequence of blocks through a virtual block device. And the decompression algorithm for OverlayBD is also much faster compared to the GZP decompression. And application system calls are translated essentially to remote calls through OverlayBD when they are needed. So this is done purely on demand. This is a diagram that can illustrate the various steps involved but we won't be going too much into the details because there's also a paper which explains each and every step of this. So when we try this on AKS, we noticed that the pod starts significantly faster. So as you see when you have a larger container image like a 6 GB image, it takes several minutes, six and a half minutes or so to start for an OCI image because it needs to pull the entire data. But for an OverlayBD image it starts instantly, almost instantly. The main caveat to note with this is that the start time here refers to the time it takes to pull the metadata related to the container image and it downloads minimal amount of data to set up the necessary components for OverlayBD. To know the accurate start time, the realistic start time, we also need to set up readiness probes that capture the true state of what needs to be ready before it started. So one of the challenges during scale up is you need to read significant data in many cases. So in this example we had a readiness probe set up that is ready only after one GB of data is read sequentially and we use OverlayBD for this case. But the first pod start takes a long time. It takes about three minutes to start running the pod. And that's also because of throttling from the registry side. And then you also see that there is a significant time when there's not many pods are getting ready. And that's because of significant network congestion and storage throttling with, you know, a thousand pods attempting to pull at the same time. So we're going from zero to thousand. And this graph will look different based on the registry and internal operations happening in the background in your cloud environment or your on-prem environment as well. So for the large language model case, which is becoming increasingly interesting for many people, we also did an extensive study. So LLMs you can think of as being made up of a series of giant matrices that convert an input text, which is converted to a vector format. And that undergoes a series of operations to get an output vector format. So all of these giant matrices need to be in memory before you can do any inference work. So for this experiment, we tried with a Falcon 7B inference workload, which is based on the Kaito framework, which we mentioned in the keynote a couple of days ago. So this Falcon 7B is just an LLM with lots of parameters, but it's one of the smaller LLMs. So for an OCI image, when you have it in the container image, it takes about five and a half minutes to start running this workload. But with pure on-demand image loading, we were curious to know how long it would take. And we saw it took seven and a half minutes, which is really disappointing. And as you can see from this, most of the data in the container image is for the model and most of that is needed at the very beginning. And that offers you some clues as to why this was slow. Now my co-presenter Yifan will share about how we can improve this and the other approaches we can use. Thanks, Ganesh. Next part, I will show you some solutions for these scenarios. First is P2P. P2P technology can be used to reduce a lot of registry and speed up the time of image pooling. Only a few nodes download data from the registry directly, and the rest will get data from other peers. There are some open source P2P projects can be used separately with pooling OCI images or with streaming OCI. Korgan is a P2P-powered Docker registry that focuses on scalability and availability. It's designed for the Docker image management replication and distribution in a cloud environment. Dragonfly is a great project that has been around for a while. However, it covers many complex cases that require additional dependencies like Redis and Mexico. It will consume more resources and results in a managed additional resources for the end users. Special is another open source P2P project which focuses on solving a smaller problem with fewer features and complements than Dragonfly. These P2P systems all have some common complements, although they may have different names. Let's take in Dragonfly as an example. Manager maintains the relationship between each P2P cluster. Peer provides upload and download functions. Seed, a back-to-source download peer in the P2P cluster. Scheduler selects the optimal parent peer for the download peer. For the previous test, if we use streaming OCI plus P2P, the cluster's request for the registry will become a fuel connection from the seed, and the rest of the peers will get data through the P2P network. The 1000 pod can read one GB data in about 23 seconds. However, there is no improvement in the LLM case. It still needs more than 7 minutes. If the node and pod scale up is triggered over time, then the benefits of P2P are much higher because the node with the image cache already acts as seed. Here, all the peers don't have any image cache. Although streaming OCI avoids full image pooling, it downloads and caches partial data from the registry based on the IO request for the workload. Data will be returned directly from the local cache when the cache hits or downloaded from the remote storage when the cache is missed. Compared to the local storage or cloud disk, downloading from the remote storage has a higher latency, especially for the LLM apps, which is almost accessing the entire image. Streaming degenerates into a full loading. For sacrificial reads, overly biddy degenerates into a circle of downloading a single data chunk, wait for the cache miss, then download the next chunk. Cache hit only takes a microsecond latency, while 100ms download data from the registry if cache miss. In addition to reducing the network pressure of data source, most P2P solution can also be seen as a large caching system. We can pre-download the image data to P2P seeds in advance when the application runs overly biddy can easily get data from P2P seeds. Instance with a large disk can be used as a primary cache seed for the P2P without the need for expensive GPU instance. Pre-hitting data in P2P can also be used with OZI images, but it performs significantly worse than streaming because we have shortened the download time of image, but we cannot reduce the decompression time. With P2P system, overly biddy transform remote data accessing into the nearby data accessing, reducing the IO latency from 100ms to the ms level. It is worth mentioning that overly biddy also brings less pressure to the disk and the CPU than OZI images. Because the file module supports random processing for the compressed data, it won't bring extremely high CPU overhead like GZP decompression. At the same time, overly biddy saves the file for remote data on the disk, which results in the disk pressure being significantly reduced. The figure shows using overly biddy plus P2P to start the LLM container in a 32-node GPU cluster. Compared to the OZI image with P2P, CPU usage is reduced by 5%, and the disk BPS is reduced by nearly half. This is the summary of our test. We can only create a cluster with 32 nodes since GPUs are hard to find. The results in this table show that even without data preheating, P2P will not bring a performance loss. After data preheat, the OZI image with P2P can finish the start-up progress and mode loading within 2.5 minutes. Is there any way to make it faster? Of course. Since we know that the framework can only start after the all-model weights has been loaded, we can prefetch the required data in advance during the streaming processes. So, we have designed a prefetchor that will read specified files in Parallel, similar to use FIO random read with multiple jobs. For this LLM image, the prefetchor will create multiple threads to download the module weights as quickly as possible. As opposed to the single thread that pulls the module weights for the OZI image, leading to a faster pull. Only BD also writes certain data directly to the memory save compressed data on the disk, which also speeds up the pull times. In this way, it is much faster than a full download of an OZI image and doesn't have an additional cost of pre-downloading the OZI image on the node. There is also an advanced feature to programmatically store a trace of block reads at the start, push the trace metadata to the registry, and prefetch that data during future pull starts. Here is a demo comparing the pulled ready time. On the left side is streaming OZI with pre-heat P2P and read ahead. Pulling OZI image directly from the registry is on the right side. The test shows after using read ahead IO, the start-up time can be further reduced by half a minute. By the way, we only use one cache node in this case, which results in a limited throughput. If we have multiple nodes to form a distributed cache, overly BD can connect to different nodes during the concurrent IO request and will achieve a gigabyte level throughput. Okay, in the next part, Ganesh will introduce an experience in production of streaming OZI. Thanks, Ifan. So now we've seen how the workloads can be improved much further as well from the base case of on-demand image streaming, which is already quite powerful. So right now I want to share the experience in production in AKS to integrate the solution to show what happens behind the scenes and also to give you an example of what you can do if you were to manage your own setup for this. So the goal is that we want to have a simple user experience for users of Azure Kubernetes Service. So all they need to do is to say that they want artifact streaming or accelerated image loading to be enabled on their nodes and also for their images in ACR, they need to indicate that they would like this feature. So from the user perspective, the image conversion is transparent and so is the usage on the nodes. On the registry side, we have a conversion service that converts the image to the overlayBD image. So the regular workflow for a user is the same when they upload an OZI image or push the OZI image to the registry and the conversion happens in the background. And an important point to note is that the overlayBD image is treated as an OZI image by Kubernetes as well because it follows the spec. With leveraging the ORAS artifacts and also the reference API for continuity, on the node we check whether the image manifest has an attached overlayBD artifact as well. So if there is an attached artifact which happens on the backend on the registry side, then that image will be pulled for overlayBD and if it's not present, then the regular OZI image will be pulled to indicate that the accelerated image is not necessary. And there's also components on the node like the overlayBD driver, the snapshotter and the continuity config changes along with certain system D units that are present to make it all transparent to the users and also to monitor and make sure the components are working as expected. In this diagram you can see the flow of the different cases when the image manifest has an attached artifact for overlayBD and the way we've done this makes it also extensible to use different image formats in the future if necessary. So what are the key takeaways from this presentation? One is you can significantly speed up workload execution through on-demand image loading and this is very useful for most images but it's better even when the smaller the fraction of the image is needed at the start. So if you have a small fraction of the image needed at the start, then you'll see a huge time reduction. For large pod scale-up we can see that you can do it and improve that process through P2P either from a standalone perspective so without on-demand image loading or with on-demand image loading and that reduces your network and storage throttling significantly and the improvements are better when your scale-up is more staggered as well because nodes in your network act as peers to your new nodes. And when most of the image is needed at the start, in the case of LLM workloads or certain ML inference workloads, P2P with pre-downloading the image on few seed nodes and doing IO read ahead will speed it up even further and you'll have negligible cost increases. So the negligible cost increase is because you might have to just use one or two CPU-based seed nodes which have low latency access for your peers compared to having additional GPU nodes if you want to warm-pull from scratch. And for a simple user experience you'd want to hide the image conversion from the user. You can leverage ORAS to attach it as an artifact to your manifest and then you can use container D remote snapshotters like overlay BD on the node so that also is transparent to anyone who uses that platform. And then the user does not need to make any changes to the deployment specs and they can quickly enable streaming on their workloads. So with this we hope that you can speed up your podstarts and make sure that you can scale quickly and ensure that nobody has to wait for a while to book the Taylor Swift concert tickets. Thank you. So please leave us feedback here and we'll also have some time for questions. By the way, a couple of things. We've linked all of the resources that we've mentioned, many of the projects of all of which are open source. So you can check that out. And we'd also like to give huge credits to many people from our various teams who are involved in supporting this project. Okay. Yeah. So the question is since this talks about lazy pulling, right? So how do you ensure image integrity? How does one know that this is really pulling what I wanted and not? Like something is not switched in between. Great question. Great question. So I think you're relying firstly on the conversion service itself. So because that's completely handled by pureback and you can see the image being converted to a different format. And then you can also leverage Aura's artifacts to make sure that the content is the same. And one way for a user to verify that is they can download the same data on the node. They can actually check whether that same data is present anytime. And that ensures as well that we have the content as the same. So the other option to do in the future as well is to make it easy for users to verify it by hashing some of the additional content too. Hey, thanks for the presentation. I wanted to know how the seeding per node was set up. Like is it the open source projects you shared for setting up seeding per node? Yeah. So the seeds are for a set of nodes. So you could have one node which acts as a seed node. And there's these open source projects like the Darty P2P or Dragonfly which actually set up some of that for you automatically. So thank you. Thanks. Well, very well presented talk. I had a quick look into the spagle docs. And one thing I've wondered, I've always been sort of surprised that containerD stores your image twice on disk, right? It stores the compressed form and the uncompressed form. And sometimes there are compressed artifacts in there. It's a bit redundant. And it looks like spagle then stores it a third time, right? It stores it once inside its registry. That registry serves to containerD, which then downloads the compressed artifact from the local pull through registry and decompresses it. I was wondering if you're aware of any other of these P2P solutions that avoid that extra duplication? Yeah. So the OverLabD project itself actually has some of this optimization to avoid the decompressed format always being stored to disk. So that's related to what he was sharing about for the lower pressure for disk as well. And that's because you don't store the full decompressed format on disk as well before you read it. So that's one of the optimizations that's done in this project. Thanks. I missed that. Okay. But on the P2P side, if we sort of separate it from the streaming side, do you know if there's a P2P solution that like, for example, can serve from the same storage as containerD? So the P2P side as well, even for the... Can I just jump in? Okay. So Spiegel actually serves from containerD storage. It does. Yeah. So that's just, if you have other questions, I can probably answer those also. Awesome. Thank you. Wow. Thank you. And by the way, Spiegel's really simple to you. So great job. Awesome. Thank you so much. And we hope that you have a great trust to the weekend as well. And thanks again for coming to this talk.