 Hi, folks. Good evening. My name is Manasi Bhattak, and I'm the founder and CEO of Verda. We are an ML infrastructure company providing a platform for the delivery operations and management of ML models. And today I'm going to be focusing on a topic that has come up time and time again with our clients and partners, which is when you're serving ML models, doesn't make sense to adopt a serverless infrastructure. So with that, let's get started. A bit of background on myself as well as Verda. So I found a Verda based on my PhD work at MIT. This was on a open source model management and versioning system called ModelDB. Just as Git is the de facto version control system for source code, we found out that models didn't have an equivalent way to version or manage them. So we built this open source system that is now maintained by Verda. And we have expanded that significantly to provide an end-to-end ML ops platform, just like DevOps and the tooling around DevOps has enabled software teams to ship code more frequently and more reliably. The Verda ML ops platform helps ML teams ship models more frequently and in a reliable fashion. So the reason this talk and the work that we're presenting in this talk today came about was that we help a lot of clients and ML teams deploy their ML models. And serverless is a very appealing paradigm in this case because it provides scaling ability. It also has a promise of reducing costs because you don't keep resources around that you don't need. And so what we decided was that we would run a set of benchmarks comparing ML inference in a serverless setting compared to a non-serverless setting and so provide the sort of best advice to our clients. So today I'm excited to share some of that work with you all. And without further ado, let me get started. What I'll be covering in the remainder of my talk is what a serverless, why is it interesting, unique considerations for ML serving that come into play. Then we'll go to the benchmark and results there. And then I'll wrap up with some key takeaways and how you can determine whether serverless is appropriate for you. So serverless, what is it and why is it interesting. If you think about the different ways in which you can run software applications today, you'll find that there's largely four categories. One is you can start with a bare metal server, manage the hardware and then run applications on top of it. The next one up the chain is virtual machines. These abstract away the hardware and you can have multiple guest machines running on your host machine. Further up the stack we have containers that help us abstract away the operating system entirely and we're left with only the application and its dependencies. At the very top we have serverless where we're only thinking about the function that we want to execute. We're not really thinking about how to run it or how to scale it. That's the value add that is brought in by a serverless platform. As you might imagine as we go from bare metal to serverless we're going up the abstraction chain, which means it's easier to use and yet it has less control and less flexibility as a result. Today I'm going to be focusing on serverless and containers and comparing these two as ways to run ML workloads. Serverless can be a confusing term for many people and so let me throw out some definitions first. This is an abstract one that I really like by Martin Fowler. It says serverless is at its most simple and outsourcing solution. What that means is with serverless you don't need a provision or manage the servers. Also your application doesn't have a long running server component. You don't have a loop that's sitting around waiting for application to get requests. Instead it is much more an event driven paradigm where every time a serverless function is invoked the underlying platform is going to spin up one or multiple copies of the serverless function and then that is going to be used to service requests. As a developer you're only writing the coder business logic. You don't need to figure out how to run or deploy it. The platform is going to take care of that. In addition the platform is going to take care of scaling up and down of resources. If you have a certain burst because your app is very very popular the serverless system is going to scale up to support that. In contrast if it's a holiday and no one's using the app that you've built for workplace productivity then it's going to scale to zero until there are requests. Finally serverless has many different flavors. What I'm talking about in this talk is function as a service. Some of the popular serverless systems are Lambda's, GCP, Cloud Run, Cloud Functions, etc. That's what serverless is. Let's quickly look at why serverless is interesting. Some of the pros and cons when you are choosing to adopt a serverless setting. First of all serverless is easy to use. Developers focus on business logic, don't need to worry about how an application is going to run. Second it's able to scale. The more requests you get and as your workload varies more copies of the serverless function are created automatically. Similarly it's going to scale to zero when your applications is not receiving requests. Overall this means that the maintenance overhead of infrastructure is very low. You don't need to manage the nodes, perform upgrades, tune resource requirements and so on. As a result overall whether we're talking about total cost of ownership or in certain cases depending on the workload even the infrastructure cost can be lower for a serverless system compared to a non-serverless system. So some popular applications that are good fits for serverless include asynchronous message processing, IoT workloads, stream processing and so on. So those are cases where serverless is good fit but that doesn't mean the serverless is a good fit everywhere. For instance serverless applications are stateless. So if you have an application that does require a state to be present in it then serverless may not be a good fit here. Second there are implementation restrictions. So whether you're talking about lambdas or a cloud run there are limits on how long a function can run, the resources it can ask for, concurrency and so on. In addition you can choose or control the hardware. So if you're looking for a particular processor or accelerator the serverless platform might not support it. However a Kubernetes based platform is likely to support that. And then finally because the serverless applications scale to zero when they're not in use whenever the first request comes in and there are no instances running there can be a large latency which can result from cold start. You're just spinning up the first instance of this particular function you're downloading dependencies and that can take a while. So given this background when does serverless make sense usually? Well it makes sense when the application can be made stateless. This usually not a deal breaker. You just store the state in a database, in a file store, blob store and so on. The second one is that resource requirements are modest. So they fit into what the serverless platform expects as typical resource requirements. The third one is that performance requirements or SLAs are not super stringent because as mentioned earlier cold start is a real issue. And so if there are type performance requirements here around latency in particular then serverless needs to be used with care. And then finally serverless makes sense when query workloads are not steady. If you have a very steady workload that uses your servers pretty effectively then serverless doesn't make as much sense because it's not going to save you resources and therefore costs as well. And then finally serverless really helps with reducing infrastructure burden. And so if your team really has a lot of burden around managing infrastructure serverless would make sense. So with that we can look at what ML surveying entails and how serverless plays with that. So you need considerations for ML surveying. First of all ML surveying means making predictions against a trained model. So there are different stages of the ML lifecycle. Your first training, your first cleaning the data, you're preparing the data, then you're training the model and then you're making predictions against the trained model. So we're talking about that third piece of the ML lifecycle. And the first consideration that's unique for ML is that models can be large. And what we mean by that is a model can have millions and millions of parameters. As a result the serialized version of the model can be hundreds of gigabytes. So Distilbert one of the smaller NLP models is actually 256 MB. If you compare that to your vanilla Python function that you might run, that's fairly large. And we'll see how we can run into problems when we're running a Distilbert model in a serverless setting. We run into similar challenges for ML libraries. ML libraries can be varied. They can also have a lot of dependencies. And as a result, loading up these libraries can hit some of the walls that are imposed by resource restrictions for serverless platforms. Then finally ML models may need to be served via GPUs or TPUs. They're more efficient that way and serverless systems currently don't enable that. So we now have a better sense of where serverless makes sense and how ML serving workloads might compare to that particular situation. So next let's look at the benchmark. Our goal with this benchmark was to identify when it makes sense to use serverless for ML serving. And we evaluated that question on a variety of systems. Today I'm going to be focusing on three systems. These are managed services. And we picked these because we found that these systems were most mature among the offerings out there. So first one is lambdas. This is the state of the art in serverless. The next one is running serverless on Kubernetes. This is using the Knative project and we chose to use the Google Cloud Run platform in order to perform the benchmark. Finally for a container-based platform, we use the Verda system. And what that entails is that Verda packages and runs models as containers and does the scaling automatically similar to a serverless setting. So let's quickly look at how each one of these systems works because that'll help inform the analysis will perform shortly. So for AWS Lambda, this is the managed serverless platform. You write code for the Lambda function. You upload the package code to S3. Also upload any other dependencies. And then every time that the Lambda is triggered via an event or an HTTP request, the platform is going to spin up a Lambda instance. It's going to execute it. It's going to scale it up and down as required. And when the request is done and there are no other requests, it will scale to zero. So that's generally how a Lambda works on AWS. If we look at Google Cloud Run, it's pretty similar except now we're talking about containers as opposed to just functions written in a particular language. So in this case, what the end user has to do is they upload a Docker container to the container registry. And they tell Google Cloud Run about the container, how it should be invoked. And then every time that there is an event or HTTP requests, it's going to trigger a Cloud Run execution. And as before, the platform manages all resources, scaling and endpoints. Finally with Verda, as I mentioned before, the platform runs models as containers on Kubernetes. And how it works is that the model is uploaded to the Verda platform. You can optionally specify resource and hardware requirements that you are looking to have met. And then you deploy the model on Verda. Notice that the step was missing in the serverless workflows. And that's because with serverless, you can spin up serverless instances on the fly. However, in this case, we deploy at least one, we will always have at least one replica running for the model. And so that reduces our cold start costs. However, it does mean that if there are no requests, we still have the model running at all times. And then similar to serverless platforms, Verda manages just the endpoints and scaling of models. All right. The last part of the benchmark spec is metrics. We measure warm start latency, cold start latency, auto scaling latency, and also usability concerns. Like can you actually use serverless for this workload or are there hard barriers which make it impossible to use serverless? Finally, for the workloads, we tested these systems on a variety of models, including NLP, computer vision, traditional ML models. Today, I'll be focusing on just a couple of NLP models because the trends are pretty similar across the board. And we have different workloads that vary QPS or queries per second. All right. So before I go to the results, I do want to point out a few caveats. The first one is that serverless for Kubernetes is still evolving. And so the numbers that I'm presenting here are based on the capabilities and software available today. If you rerun the numbers in two months, they might be different. And so what we have done is that this benchmark is a living benchmark. And you can find the full set of results at that URL and also learn about how you can run the benchmark yourself. Second one is we're using managed services in this benchmark. And as with all managed services, there are knobs that cannot be seen or controlled by the end user. And there are likely optimizations performed under the hood, which means that it's a bit challenging to perform a apples to apples comparison. The best that we can do and what we've done in this benchmark is to use best practices recommended by the cloud providers and to use off-the-shelf settings for the managed services. So let's first get started with results around usability, because if you're not able to even use the serverless platform for a particular model, then the quantitative numbers don't make sense. So the biggest hurdle that we found in using serverless for ML has to do with the restrictions that are put on resources. So here I'm listing out the restrictions that are put on Lambda and Cloud Run with respect to memory disk and CPU. And what this means is that if you require memory greater than 3 gigabytes, then lambdas are not an option for you. You cannot use the serverless platform. Similarly, if you need more than four virtual CPUs, it's a no-go on Google Cloud Run. You're just not going to be able to do that. Now with ML, as we discussed before, ML models and ML libraries tend to be rather large. And so it happens more often than not that your ML model doesn't fit in the constraints imposed by the serverless platform. To give you two examples, the first one has to do with distilled BERT. And to run distilled BERT, what you have to do is install Torch and to install the transformer library from Huggingface. In this case, both of them together are greater than 500 megabytes. And for Lambda, this falls outside of the realm of what's doable. And so what we had to do was to surgically remove pieces of library and code and get creative with model loading so that we could actually use distilled BERT on a Lambda. So if your model is larger than the expected resource constraint, you need to get very creative and significant wrangling is involved there. The second one is another example that we have seen with multiple clients. There's an embedding model for an entity and then there's a nearest neighbor lookup model that's going to find entities that are most similar. And because this particular model includes an index that can be pretty large, the model ends up being greater than 20 gigabytes. That's way outside of the realm of any of the serverless platforms. And so you cannot use a serverless platform in order to serve this particular model. So last one I want to highlight here is configuration options. ML libraries and low level linage libraries have optimizations that can be tuned via environment variables. However, in a serverless setting, these environment variables or the hardware that determines these environment variables are not exposed to the user. And so it's hard to set these environment variables correctly. All right, so the set of results that we looked at so far have to do with usability. Can you actually use serverless for this workload? Next, we'll look at some numbers that assume that you've gotten to running your model on serverless. How effective or efficient is it at that? So first thing we'll look at is warm start prediction latency. Warm start is there's already a serverless application running and there's no need to download the model or spin up a instance from scratch. We see here that AWS Lambda and Google Cloud Run are pretty comparable across the P50, P95 and P99 latencies. Interestingly, the Verda platform is actually faster by 2x. This can be attributed to more control over the environment. So this can be processors that we use or specific environment variable settings. Since Lambda and Cloud Run are a closed source, it's hard to accurately identify what might be the reason why Verda is running 2x faster. But it likely has to do with the environment that we're running in. And here I'm noting the configurations that we use for all three systems. We have tried to keep all the configurations comparable so that we're doing a apples to apples comparison. The next result has to do with cold start prediction latency. So this is a case where no serverless functions are running already. And so this is the time required to spin up a new instance of the serverless function and then service a request. So in this case, we find that for Verda, we always have at least one replica running. And so it's extremely fast to make the first prediction. On the other hand, in the Lambda case and Cloud Run case, it takes several seconds in order to spin up the first instance of the serverless application. And so we find that the latency isn't the tenth of seconds as opposed to less than a second for Verda. So the previous results were around time to first prediction. The next set of results are for scaling latency. So one of the biggest advantages of serverless is its ability to scale based on the workload. And so here we're comparing the time required to reach steady state. Which means in this case, we were testing with 100 QPS, one worker per query. We're saying that we know we've reached steady state when we start receiving 100 responses per second. And this is successful responses. Here we see that Google Cloud Run is quite fast, 33 seconds. Lambdas are twice as slow. Verda is significantly slower. And this is to be expected because we're now only relying on the Kubernetes auto scaling. There are no optimizations being performed in order to always have a warm replica is ready to go. And so on. So the optimizations that a serverless system might implement in order to perform auto scaling are not present in the container based scaling based on vanilla Kubernetes. All right. The final results here is what happens when we vary the model size. So here we're comparing the still birth against birth. Birth is twice as large as the still birth. And we find that this does impact the latency. We see that the P 95 is twice as much. And the time to first request, however, is not that much longer. Particularly for AWS for Cloud Run and for Verda, we do see some degradation in the time to service the first request. So that's a quick overview of some of the key results in comparing serverless versus non serverless for ML survey. Before I wrap up with a key set of takeaways. I do have a note on cost because cost can be one of the key reasons why teams decided decide to go the serverless route. And this has to do with two things. One is if we discuss only pure infrastructure costs, depending on the workload and the amount of resources necessary, the pure infer cost may be comparable or even higher in a serverless setting than in a non serverless setting. So if you have steady state in your query workload and your queries have good utilization of the servers, it might be cheaper to just use a non serverless solution. However, one place where serverless does shine is TCO or total cost of ownership. This is a combination of infrastructure costs, maintenance costs and development costs. So if TCO is extremely important to you, then serverless would make sense. Otherwise, it requires a more nuanced view on what are the infrastructure costs that we're actually signing up for. Alright, so with that, let me quickly summarize what we learned based on our benchmarking results. And hopefully this is helpful for people to make their own decisions on whether serverless is a good fit for your ML serving use case. So first of all, serverless solutions have hard limits on resources. So if your model doesn't fit within these resources, then serverless might not be a good fit for you. Second, the ability to configure hardware can lead to better performance and this is easier in non serverless systems. So this is the case where the Verda system was 2x faster and it had to do with the environment that we were running in. Next, scaling with the serverless platforms is much faster than the vanilla auto scaling that we could do in Kubernetes even with custom metrics. There's special optimizations that are performed by these platforms for auto scaling that aren't provided out of the box if you're running containers on Kubernetes. And then finally, the query pattern and workload impacts cost. And so in order to assess whether serverless is cheaper than non serverless for prediction requests to ML models, we need to look closely at the query patterns and workloads. So those were the key points that I wanted to convey regarding the benchmark. Please check out Verda.ai serverless inference benchmark for the whole set of results and also to learn how you can run the benchmark yourself. So with that, thanks very much for your attention. I would love to continue the conversation around the benchmark. Please feel free to reach out and monocle at Verda or data serial with questions and I'd be happy to feel them there. Thank you.