 Hello everyone, welcome to our session on crack, the fast cold start and the scalability bottleneck. My name is Casey Zhang, along with me is Rizang. We both work in the cloud native area at Intel's central software engineering team. And here's what we will cover in today's presentation. First, we will give a brief overview on the fast and the major challenges faced by current fast providers. Then we will go through the steps involved in the fast function cold start process and analyze which steps contribute the most to the cold start latency. After that, we will deep dive into a new approach of creating a function instance from a snapshot. We will also talk a little bit about the auto scaling bottleneck and our approach of scaling out new function instances inside an existing microvia. Last, we will show our test result and compare the cold start latency between existing way of starting a function instance versus our snapshot based away of starting a function instance. Before I deep dive into our snapshot based away of starting a function instance, I would like to first set up the context for the discussion. So what is fast? Fast is an event-driven architecture-based computing service which allocates microvms or containers on demand to run the developer's function code in response to event requests. Fast provides three key features. The first feature is automatic on-demand instantiation of function instances to run the function code upon a trigger event. The second feature is on-demand auto scaling out and in, no need for the user to plan for peak traffic. And then the third feature is utility-based billing. The user never pays for idle time. And fast, OK, sorry, it's called too much. Fast is said to be the future of cloud computing. It is gaining a lot of momentum in the industry. The diagrams on the right are sourced from Betadog. The top diagram shows the percentage of organizations using fast provided by AWS, Azure, and Google Cloud. At least one in five organizations is adopting fast. And the bottom diagram shows the average daily invocations per lambda function index from 2019 to 2021. The lambda functions are evoked 3.5 times more often in 2021 than in 2019. And to support multi-tenancy and security, functions are usually run inside a micro-VM-like sandbox. And fast provides a lot of benefits to the user. For example, the users do not need to manage any infrastructure since the fast platform will take care of when and how to create a function micro-VM. The fast platform will manage auto-scaling alternately to meet the peak and valley of event traffic. The users only pay per usage. The fast platform will not charge for the micro-VM resource after the user's function completes execution. And with all these benefits, it also brings some challenges. One of the challenges is the co-start latency. Co-start latency refers to the time it takes the fast platform to create a new function micro-VM, set it up properly, load the function language runtime, and get the function manager ready to start the function execution. If your function has been scaled down to zero, it can take a few seconds to get a new function instance ready to serve. The other point is the speed of auto-scaling in response to burst traffic. Scaling up many new function instances in seconds is really hard to do for the obvious reason that it takes time to co-start all these new function instances and get them ready to start exact traffic. This diagram in the slide shows the steps involved in the function startup process. In the container, this process is in the container D context. Let's go through each of the steps. So when the event trigger comes, the container D will pull the function image from a remote registry if the image is not at the node's local disk. Then the container D will set up the root file system and start the runtime sheen component. When the runtime sheen component starts to run, it will set up the CNI network for the micro-VM and then create the micro-VM, which serves as the running environment for the function instance. Then the micro-VM kernel will be booted up, and some user space components of the micro-VM will be initialized. After that, a container will be created. Now the run-sick component will take over and start loading the language runtime. So if the function code is written in Java, the Java runtime and JVM will be loaded. If the function code is written in Python, the Python interpreter will be loaded, et cetera. Each fast platform usually has a function manager component, which will be the first program to run in a newly created container. And this function manager will then invoke and execute on the user's function. So a few mechanisms are being used by current fast providers to reduce the cost start latency. For example, some fast platform will not delete a function running instance after it finishes its execution. Instead, it will keep the function run in idle state for a state extended period of time, for example, for 15 minutes, hopefully to catch the next trigger event. So when the next trigger event comes, the function that runs in the idle state can start, serve the traffic without going through the code start cycle. But this mechanism introduces idle cost, which is not good. In this presentation, we will introduce a new snapshot-based mechanism to reduce the cost start latency. We can see that step one through step 10 all contribute to the cost start latency. Among all these steps, we found out that step one takes a lot of time and step five through step 10 also takes a lot of time. Of course, if the function image is already loaded on the local disk of the node due to a previous run of the same function, then step one through step three can be skipped. So in our new mechanism, step five through step 10 will be replaced by a single step. And that is instead of going through step five through step 10 to create a new function instance, a new mechanism will create every new function instance from that function's snapshot. And different function code images will have different function snapshots. A function snapshot was taken at the time the user registers and uploads its function code to the fast platforms code registry. The function snapshot will be saved together with a function code image in the fast platforms code registry. You may have question on, you may have question on what is the function snapshot? So what we mean by a function snapshot is a snapshot of the micro VM's memory and it's virtual hardware states taken after the function VM, a function micro VM is created. Then how do we get a snapshot of a user function? We will get a snapshot by test running the function through a whole life cycle. That is, we will take a snapshot of the function micro VM's memory and virtual hardware states when the function finishes the execution and returns to the idle state waiting for the next trigger event. We experienced some problems when creating a function instance from its snapshot. The first problem is that some function snapshot has some user sensitive data such as password, secrets, et cetera. For security purpose, we would not like the snapshot to contain any user sensitive data. The second problem is that some function snapshot may contain a few unique IDs associated with that specific function instance created during the test run. And these unique IDs are not portable to the new function instances. The third problem is that a function snapshot contains some unique data associated with the specific micro VM instance created during the test run, such as the IP address of that specific micro VM. And these unique IDs are also not portable to the new function micro VM. The fourth problem is that the function code packet becomes larger due to snapshot data added on top of the original function code image. This larger code package incurs longer loading time. So the solution to address problems one and two is to take the snapshot at the earlier time point. Instead of taking the snapshot after the function finishes the execution and returns to the idle state, we can take the snapshot right after starting the function platform's function manager. But before it invokes the function, I think here we have, okay, I need to go to the next slide. Okay, sorry. Why this slide does not scroll down? Hold on a second. Too much. Oh, here, okay, sorry. So here instead of taking the snapshot after the function finishes the execution and the returns to the idle state, we can take the snapshot right after starting the fast platforms function manager. But before it invokes the function because no user function is invoked, the snapshot will not have any user sensitive data or any unique IDs associated with a specific instance. Okay, so to address problem three, we develop a small piece of code which will regenerate a set of new unique data and associate them with a new micro VM instance. And this piece of code will be executed after step five but before step six, hold too much. Okay, on to address problem four, we break up the original function code image into two parts. One part is the essential code blocks which are needed to start run the function in majority scenarios. And the other part is non-essential code blocks which are either not needed to run the function or are only needed in very special scenarios. You may have a question on how to identify the essential code blocks. We identify the essential code blocks by running the function in several different test environments and marking those code blocks that are loaded into the function micro VMs memory during those test runs. We then group those code blocks into a essential code block layer. Oh, sorry. It is now a well-known fact that only a small portion of a function's original code image is needed to run the function. And our test result also validates this fact. Now that the original image, a function image has been broken up into essential code blocks and non-essential code blocks, sorry, let's go down. Our approach can further reduce the code start latency by starting the function creation process at an earlier time point, which is the time right after downloading the essential code blocks. And the remaining non-essential code blocks can be downloaded in parallel in the background or as needed. So in summary, our enhanced snapshot-based approach has addressed the four problems. In our approach, the snapshot will not have any user-sensitive data. Neither does it have any unique data associated with any specific function instance. As to those unique data associated with a specific micro VM instance, we have developed a small program which will regenerate a new set of unique data and associate them with the new function micro VM. By breaking the original function code image into essential code blocks and non-essential code blocks, our approach can start the function creation process at a much earlier time point, which further reduces the code start latency. Now I'm going to give a brief introduction on our thought to support fast auto scaling in response to burst traffic. No matter how much we slip down the micro VM, creating a micro VM takes time. In our investigation, we found out that the function startup process spends a lot of time booting up the micro VM kernel and initializing the user space components of the micro VM. So instead of creating a lot of new function micro VMs, we can speed up the scale-out process by adjusting an existing running micro VMs resource boundary and creating more function containers inside that running micro VM, as long as the node that holds the micro VM has enough resource. And now I'm going to hand over to Ray to talk about our test path and our test result. Okay, thanks, Cassie. Okay, folks, as Cassie mentioned, I will walk you through some data records from our benchmark. But first of all, let's start with the testing environment that we used. We used an Intel Cascade Lake server with 264 gigabytes memory. The storage we used is a Intel SATA SSD. We installed Ubuntu 20.04 on this server and we used the Firecracker Kineer D and the Firecracker Micro VM for testing. Both Firecracker Kineer D and Firecracker VMs are projects open-sourced by Amazon. You can get their source code from it hard. We used the Firecracker Kineer D recommended CNI setup and they're recommended the kernel for testing. We implemented the snapshot-based way of creating new instances in Firecracker Kineer D. Okay, now let's look at the numbers. This page shows the comparison of cold start latency between the existing way of creating a new function instance and the snapshot-based way of creating a new function instance. We tested three languages, Python, Node.js, and Java. For each language, we selected three functions with their code complexity spanning from simple to intermediate and to complex. The time we measured for both ways of creating new instances is right after the runtime shame is created and before exacting the function code. The reason we chose to measure this period of time is that they start with the same state and they end up with the same state. And it includes the most different part of the two approaches. The bar chart shows the cold start latency for each of the selected function of the two approaches. The numbers are an average of 20 times one of the functions. The blue bars are the latency of using the existing way and the orange bars are the latency using the new approach. The number for the blue bars do not differ very much so as to the orange bars. This is inconsistent with our design because the time we measured do not include the function execution. We can see the existing way costs a little bit of about 1,700 milliseconds. And the snapshot-based way costs about 630 milliseconds. The snapshot-based way saves about 60% of the time. As to the function code start breakdown, we measured this is the marrying of a Hollywood Node.js function. We use it as an example. The upper diagram is the existing way of starting a new function instance. We can see that the longest step of the existing way is the step seven, user space initialization. In this step, the guest operation system puts up into a multi-user environment and starts several services including the most important the five-quarker agent service. The five-quarker agent service is the agent inside of a micro-VM handling the left cycle of the renzi container. So by loading from snapshot, we shorten the time cost by the step five, six, seven, eight, nine, ten into the two steps for the below diagram. We actually converted the 1,431 milliseconds into the step five and step six in the below diagram which is the sum of the time which is the 284 milliseconds. So the snapshot-based way is much shorter. So as Casey has introduced, one of the challenges comes with snapshot-based ways that we have to download the snapshot file before starting the container. This page shows the sizes of function image, the essential code block size, and the snapshot size for each of the nine selected functions. We can see that the essential blocks are smaller, a small portion of the overall image size which means that only a smaller portion of the data are actually used during the function test run. We took the function snapshot by a 5-quarker micro-VM with 128 megabytes memory and we further compressed the memory snapshots. So the snapshot sizes do not differ much for each function. We can see that they are above something about 50 megabytes. This is also inconsistent with our design since we took the snapshot before running the function code. So we can see that even combining the essential code blocks and the snapshot file, it is still much smaller than the original image size. This will result in a shorter downloading time and a shorter costar time. Okay, that's all for our session and we will begin our Q&A session now. Yeah, thank you.