 All right, let's get started. Thanks for coming, folks. Today's talk, I hope, should be a fun talk that's a bit of an alternative talk about serverless. This talk is not so much about using serverless functions as it is about the performance optimizations that occur behind the scenes in the implementations of functions platforms. So just can I ask for a quick show of fans of how many people have used any kind of serverless function? Lambda, anything on Kubernetes? Yes? OK, cool, cool. And so this talk should tell you, hopefully, a little bit about why they're slow, when they're slow, what's happening behind the scenes when there's a cold start, why does it take the amount of time that it takes. My name is Som Vasani. I've worked for a while on fission.io, which is a functions platform on Kubernetes. It's open source. You can go to fission.io and check it out. But this talk is almost completely an agnostic of the actual functions platform. So I won't talk much about fission. And even though it runs on Kubernetes, we'll talk just a little bit about Kubernetes, but mostly about the functions implementation itself. So the central trade-off in serverless functions implementations is that of performance versus cost. The user of a serverless function system wants basically infinite scaling. You want to just never think about scaling and have the system take care of whatever throughput is thrown at that function. You want low latency for some definition of low, no matter how many requests are coming in, what the shape of the workload is, you want very low latency. And you also want the service that's exposed by that function to have high availability. Now, the user wants all of these things, while also having low costs. And a big, big selling point of serverless functions is that you don't pay when they're idle. A serverless function that's not running costs you very close to zero. You should pay only for the artifact that's being stored, and even that is pretty small. So there really should be zero costs when they're idle. There should be no background fixed cost. If you look at the AWS Lambda pricing, you pay per invocation. You don't really pay a subscription fee, a flat fee of something per month when you start using Lambda. And more importantly, you want very granular billing. So you want to be billed for the exact number of nanoseconds that you're using your function. Now, neither of these ideals is reached by any real framework. And what I'm going to show in this talk is that anything that optimizes one thing comes at the expense of the other end. So we can optimize cost at the expense of latency. We'll talk about optimizing scaling. And it'll come at the expense of cost, things like that. So the big question is, what are the trade-offs to make? And where on that set of trade-offs should we situate our system? So to talk about this, we have to talk about the implementation of servers. And any serverless discussion must start with the fact that a serverless system has a lot of servers. And this talk is about actually talking about those servers. So don't pay too much attention to that picture. But the idea of this talk is to model what a function's platform implementation looks like. And then abstractly talk about the algorithms, the optimizations, the trade-offs it has to make. Very broadly, you can divide the function systems into build stuff and runtime stuff. So build stuff goes from source code to an executable artifact. We are not going to talk about that because it doesn't affect our runtime performance much. Maybe the size of the artifact matters, but other than that, the builds don't affect us. So we're going to talk mostly about the runtime. So to start with, let's talk about the central piece of state in any function's platform. And that's actual storage of functions. Most of these systems divide that function store not into just one sort of blob store of functions, but also a metadata store that's queryable, cacheable by the various components. So the metadata store stores things like the configuration, how much memory do I need, what privileges the function should have, what environment does it run in? Is it Node.js, Python, et cetera, et cetera? It contains versions, how various things related to that. And it contains a pointer to the actual artifact, the executable artifact of the function. And in this case, I've shown Python, but it could be anything really. And in different platforms, that artifact is also different. In public clouds, for example, with Lambda, you have zip files, and now you have layers as well. And in many of the Kubernetes-based systems, that function store is actually a Docker container registry. So now let's try to model the actual runtime. So this is somewhat of an abstract model, but we have good reason to believe that most functions platforms more or less satisfy this model. So we start from the bottom right of that picture. If a function is running, it exists as an instance somewhere on a cluster. We're going to talk about Kubernetes, but you can imagine any cluster scheduler instead. So given that we're talking about Kubernetes, that function instance must run in a container. As Kubernetes does, it's going to run inside a pod, which will have a network identity, an IP address. And once a function is running somewhere on the network, you need to be able to route events to it. So there exists a router that's on the bottom left, and that translates events to function invocations over the network. So most function platforms support HTTP, some sort of events from queues, timers, and so on. And they forward things over the network to the running function instance and return a response. And there has to be a deployment system. Because we're talking about serverless, the function instance does not always exist. It comes up. It goes down depending on usage. So there must be a deployment system, which the router has to know how to interact with. So let's talk about the interesting part of serverless, which is cold starts. When, let's say you first created this function, it has never run before. The first event that's routed to this function is going to cause a cold start. So let's just go through the steps that the system must go through to implement a cold start. So a function execution event comes into the router. The router figures out that it's for so-and-so function. It knows there's no function instance, because it doesn't have anything in its internal cache. So it forwards a request to the deployment system, which must query the metadata store. Again, some of these things can be cached. But it then must make a request to the underlying scheduler, which then must cause the function artifact to get downloaded onto some server somewhere, which then causes the function instance to come up. So this is all the startup time of the container, the runtime inside it, and so on. And now the router can forward a request and proxy back the response. So there's a whole bunch of steps here. And all of these have to run within that cold start time. So in other words, the time taken by these steps is your cold start time. On the other hand, after the function has cold started, the rest is very fast. Any further events that go to that function are simply proxying to that instance and returning. So to recap, a cold start takes, it triggers a deployer, it fetches metadata, deploys a pod, fetches the artifact, then actually runs, deploys the function instance, then the request is forwarded, and finally the function runs. And once a cold start has occurred, the most function executions are then warm and are simply routed and executed on that instance. So a big part of functions platform's performance optimizations are, one, making sure that most executions are warm rather than cold, and two, making sure that when cold starts occur, they occur as fast as possible. So more warm executions and faster cold starts. So how well do functions platforms actually do cold starts? These graphs are from a paper, from a 2018 USENIX paper. I wouldn't pay too much attention to those numbers, especially this talk is not about a comparison of clouds. And it's old enough that those numbers might have changed significantly. But the point here is that you definitely don't want every function execution to be a cold start. And depending on your workload, even the cold starts that do occur might be a significant problem for your workload. So we're going to talk concretely about four optimizations that a platform might do. And all of these are related to either speeding up cold starts or reducing them. So the first one, and this is something that's implemented by pretty much every system, is reusing the function instance. And here, this is a simple idea that instead of having every execution trigger a new instance of the function, you reuse the function instance. But there's some more subtleties here. So for example, if you look at the execution model of lambda versus that of Azure, it's really different. So in lambda, there's a concurrency of one for every function instance. But in Azure, it can be up to 50, I think. And so this affects how many cold starts happen and the workload that each running instance gets. And broadly, there's a spectrum of every single request gets a new function instance versus every single request gets uses the same instance. You don't want to be at either end. Obviously, the far end has scaling problems as well, so you definitely want to scale up. And simply having a burst of requests is going to cause a number of cold starts. But different platforms choose different points on this spectrum. And there is some, you'll see variability in performance because of the reuse as well. So in a model where there's exactly one function instance per, in a model where you have a concurrency of one, a single function's slowness cannot affect another function's execution because it would just go to another instance. But in that model, you would have more cold starts. So there's a trade-off to be made here along the reuse versus isolation path. But yeah, like I said, almost all of the systems choose some path, some point along this trade-off. So this changes the model a little bit where the router must keep a cache of what it got from the deployer. So it doesn't have to keep talking to the deployer. And so a request can come in. The router can simply hit its cache and forward straight to the instance. Now there's a question of, OK, if you decide to reuse function instances, once the requests stop coming, should you immediately kill that instance or should you keep it alive for a while? If you keep it alive for a while, how much time should you keep it alive for? So the trade-off here, again, is resource versus cost. It's resource versus performance. You can keep function instances alive for a very long time. And that will mean that more and more cold starts are eliminated. Or you can keep them alive for a very short time, which will mean that your actual resource usage is following usage more closely, but you have more cold starts. One thing to note here is that the public clouds have very long keep-alives of the form six hours to a few days. It's really interesting, by the way, how you figure this out. You can actually store some state inside a lambda function and then keep querying it to see if that state has disappeared or not. You can keep that state in memory. And by doing this, you can figure out how long it took for Amazon to actually kill your function. So people have done these experiments and found these numbers. Again, they may differ depending on the system workload and so on, but this is what we know. Whereas the functions platforms on Kubernetes can't really do this. They can't keep functions alive for days and days because a Kubernetes cluster is essentially a multipurpose cluster. It isn't just for functions. You may be running other microservices, and you can't really allocate away every single resource you have to your functions system. So that's why when load goes down, most of the Kubernetes-based fast platforms, including Fission, only keep the container around for a few minutes at most, because those resources have to be returned to the scheduler. So that's reuse. Now, the next optimization is about using pooling to reduce the actual time that a cold start takes. So this one isn't about reducing the number of cold starts but about speeding them up. So the key insight here is that you actually can split the function instance into two parts, the function-specific stuff and everything else. The function-specific stuff is the actual function artifact and its dependencies. So it's the actual function, the module, whatever is in package.json, node modules, et cetera, that zip file. But the actual server that wraps that function isn't specific to the function at all. It's specific to the event, to the runtime, and so on. But it doesn't need to be specific to the function. And that's why you can pre-provision those runtimes without the function. And these are sometimes called stem cells or generic containers. Where the idea is you create a pool of these runtimes without any function inside them. And then you have some sort of dynamic loading scheme where a function can be dropped into a running container. So this eliminates things like the Node.js or the Java startup time from your function's cold start time. It eliminates the Kubernetes scheduling cost from the Kubernetes-based FAST systems. It eliminates the actual cost of setting up VPCs or other privileges for systems that implement that sort of thing. So it takes away a lot of time from the cold start time. And it makes your cold starts more proportional to your actual function artifact. So the way this looks is usually you have a pool. Let's say in this example, the pool is really small. Let's say you have three of these runtimes and you get two function invocations. The system will pull out two runtimes from that pool, drop functions into them. That pool will now be short to two runtimes. So this is on the left is a picture of the pool before any cold starts. In the middle, two cold starts have occurred and the pool is depleted. And in third, the pool is rebalanced. So the system says, OK, I need to prepare for the next cold starts and adds runtimes back to the pool. So when you do this, the deployment and the cluster scheduler have an extra resource pool. So the model changes and becomes a little more complex. Once you have a pool, you have to ask how big should that pool be and should we scale it radically, dynamically? When should we create it, destroy it? And again, there's no one clear answer. A bigger pool can absorb more cold start spikes. A smaller pool is cheaper. And this, it's worth noting that a pool is basically a background fixed cost of your serverless system. Any system that implements serverless functions and optimizes cold starts has to do this. But things like AWS Lambda amortize that price and add it just to your function invocations. But if you're running an open source system, you're going to see this cost as basically a fixed monthly cost. Another optimization, and this is pretty simple. This is the idea of refetching functions and keeping them closer to the runtime. So you'll notice that the cold start had an actual function transferred. The artifact has to go from the store to the runtime. And that time is, of course, proportional to the size of that function artifact. So if a function artifact is huge, it's going to be slow. If it's compressed, it has to be decompressed. The loader is usually language specific, and some loaders are faster than others, and so on. So prefetching is the idea of taking the actual artifact and moving it closer to where it needs to run. In a fast data center cluster, this is mostly important only for huge functions. But in some of the newer systems, like Cloudflare and Lambda Edge, this is probably important for latency purpose as well, even if the function artifact is really small. For what it's worth, I don't know anything about how those systems run, but I'd be willing to bet they do some kind of optimization here for those things. So prefetching can be at any level. You can prefetch into the actual runtime. You can prefetch from remote storage into the cluster. And this is the trade-off you have to make. You can leave at the top end. You always keep your artifacts in remote storage. Your storage costs are then lower. You haven't replicated your function all over the place. And at the bottom end, every runtime contains, has very fast access to every single function in the system. And so you can easily schedule functions everywhere with very low latency. But you've made so many copies that your storage costs start adding up. Prefetching, as usual, complicates scheduling as well. I won't go much into the implementation of fast scheduling. But if you don't prefetch everywhere, there are suddenly now places that can run your function faster and places that run your function slower. So the scheduler has to be aware of that and try at least to schedule them onto the faster parts. But it has to make a holistic decision with other resources as well. So that always gets a little complex. So yeah, this is basically about adding caching layers in front of the function store. Now, this optimization, which we call prewarming, is basically about predicting a cold start and saying, I'm going to do the cold start before it actually occurs so that when the function invocation comes, it's warm. So this involves prediction. And as usual, the predictions can have lots of false positives or lots of false negatives. And it can try to be as accurate as possible. So predictives, this kind of prewarming exists in many other places. Proactive autoscalers are an example. Predictive prefetching and caches is closely related. But yeah, quickly to show a picture, on the x-axis is time. On the y-axis is the number of instances. And the red line is the number of instances you actually need. So at time t1 and t2, there's nothing happening. At t3, also nothing's happening. At t4, yeah, at t4, you get one request for the function. And now you actually have to wait until t5 for that thing for the function to be deployed and so on. And then when the requests go away, the deployment can also eventually go back to 0. So it lags. So this is a sort of normal state of things for a serverless function system. But once you pre-warm it, you hopefully predict, before t3, that an execution is going to come in. And so at time t2, you say, at time t3, actually you say, OK, the function is already ready. Because I predicted it before the invocation came in. And when the invocation comes in at t4, you can actually simply route to that function. So this involves a lot of hand-wavy magic. You somehow have to predict the function is going to execute. And you can do this by analyzing the workload of that function, the stream of events. And then saying, OK, there's usually at so-and-so time, there's so-and-so number of events coming in. So some kind of rules can be inferred. You can recognize some patterns. You can do machine learning, although that's more useful for predicting scalability. But one really interesting thing to exploit is you can look at other invocations of that function from systems that you know and control. So we'll talk a little bit about that in a second. But yeah, again, with pre-warming, you have the trade-off of you can either optimize for false positives or false negatives. And that, again, is a cost performance trade-off. You can have very low latency versus very low costs. One really common trick that people do with Lambda is to have some sort of system continuously pre-warming your function. So you exploit this fact that AWS keeps your function up for several hours. And then you just ping your function every 10 minutes or whatever to keep it alive. So this trick is quite common. It works OK if your function load is not very high. But if your function gets a sudden burst of requests, you are guaranteed to have cold starts simply because you need to scale out the function. So coming back to being able to predict function invocations, a really interesting area is workflow systems. So if you've seen AWS step functions or in Azure, it's called Logic Apps, I think, and similar things, there's this idea of defining a DAG and a directed acyclic graph of functions, basically a flow chart, and then saying that each node is a function execution and defining control flow through that graph. So we work on an open source system called Fission Workflows that does this. But if you're not familiar with it, you can think of step functions or really any kind of DAG-based workflow system. And the interesting thing here is that given any particular node, you know what's coming next. So I won't go into that example too much, but it's an example of taking an image and running some kind of learning-based interpreting of that image. So one nice trick you can do with workflows is a notion of a horizon. For any given state of the workflow, there's a horizon of all of the invocations that are coming next. So you have some kind of directed acyclic graph like that. Nothing has started yet. So you start with that left node, and now you can pre-warm everything that comes after it. So the node on the left is still running, but you've already started the cold starts of everything to the right of that, and so on. That thing finishes. All those three run. Meanwhile, the horizon is whatever comes next. So those run, and so on. So there's a horizon of things that are running, and right in front of it is a horizon of things that are pre-warmed. So this is the idea of exploiting a particular system that has more knowledge of the future than the function system. And you can imagine other use cases or something like that. So in this model, there's a workflow engine that's doing pre-warming. So a quick recap of the four things we talked about. We showed a model for Functions Platforms. We talked about reusing function resources and the trade-offs there. We talked about pooling to improve cold start times. We talked about prefetching to improve the time it takes to transfer the function artifact, and we talked about various pre-warming tricks, especially with workflows. So before I take questions, I have a small wish list or maybe a prediction list of what I hope Functions Platforms will do in the future. One thing I'd really love to see is more of the public cloud systems expose some of these trade-offs to the user. All of them make cost performance trade-offs, but we, as users, simply consume these trade-offs. We don't really get to choose where our functions lie on them. So I'd love to see something like when I create a function, I also get to set my desired goal latency and my desired goal throughput. And I'd also love to see some kind of pre-warming hint API where I get an API that says, hey, this function is going to run real soon. You don't have to do anything with this information, but it statistically will improve cold start times. I'd love to see something like this. And at least in the open source system that we are working on, we hope to also implement something like this. So some links, if you wanted to explore Fishin. It's fishin.io. It's open source on GitHub. The workflow system is also there. And we're on Twitter at Fishin.io, and I'm on Twitter at SOMV. Thanks. And I can take about five minutes of questions. Anyone has them? Going once? Yes? Yeah. Yeah. Yeah. Yeah, the no-faz platform that I know of actually exposes all of that information to you. So to repeat the question, the question is, can we, as users of the system, understand when the function needs another cold start and cause that before? Oh, that part is actually more like setting the goal. That's the future. Yeah. Yeah. You, OK, right. Oh, sorry, I misunderstood. OK, so your question is, how do I identify the actual capable throughput of one function instance? And I mean, you can benchmark it. You have to guess your workload and do it. You can also measure it. There are a lot of serverless monitoring tools. IOPype, for example, it can wrap your function. It sends information to their system, and you get these nice dashboards of exactly how much throughput you're getting from a function instance. Oh, all of the functions performs at least try to abstract away the auto-scaling from you. So you don't generally have to manually scale up and down the function. It's kind of the goal of serverless is to make you not care about that. Yeah. Yeah, yeah, yeah. Yeah, yeah, yeah. Yeah, that's a good point. And I think reading the AWS documentation, they do talk about that. For example, sorry, to repeat the question, is there a spectrum between cold and warm? Do we only have to talk about two points? What are points in between completely cold and completely warm? So I think some things like prefetching are kind of partial pre-warming, taking some of the steps away from the cold start. And pooling is a way of having kind of half-warm things. But another thing I've seen in the AWS documentation is this idea of freezing a function, which is where it still uses the memory it's using, but it stops using CPU. So the process is stopped, and it's in some sort of not ready state. And then it needs to be manually, not manually. It needs to be woken up by some other module. But it's very close to completely warm, and it's very cheap to restart, but it isn't consuming any CPU. So any kind of background work that the function might do is frozen. But there's probably more. I'm sure there's a ton of optimizations inside public cloud FAS platforms that I as an outsider have no idea about. Any other questions? All right. Thanks, everyone.