 So, hola. Buenos tardes. Hello. Thank you for coming to our talk, Enhancing the Performance Testing Process for GRPC Model Inferencing at Scale. Yes, that is a mouthful. That's like 20-plus syllables. So, yes, I am Paul Van Ack. This is my colleague Ted Cheng, we're both from IBM, Silicon Valley Labs in California. Still a bit jet lagged, very tired, but we're here to talk to you about kind of what we use for production model deployment, case serving model mesh, which there was a wonderfully artistic talk introduction today earlier by Alexa. So, a little bit of overlap, but it will be a good recap. Then some performance testing and monitoring setup, which Ted will talk about. Then we'll kind of go into a brief demo. First, let's just talk about model deployment in general. It's of course an integral part of the ML Ops pipeline of the AI lifecycle. You do your training that you need to, obviously you need to host it somewhere for your applications to consume the model. So, deploying a model as a microservice is probably the most common model serving strategy, especially in production settings. So, in this paradigm, the models are exposed via an API endpoint, whether it be REST or GRPC. Then clients make an inference request with their payload to receive inference response to get what they need. And as we speak about microservices, of course, Kubernetes is the kind of deployment target of choice. So, Kubernetes, as you probably all know, solves a whole host of problems for you, right, from scaling your service up and down, launching and spinning up new model server pods as maybe request load goes up and down. Then it allows for zero downtime model upgrades using rolling deployments. Of course, the aspect of resiliency where, you know, model serving pods, it restarts automatically, right, if there's a failed pod. But, you know, with all this, even just getting production grade model serving is tough. There are some additional complexities you have to be wary of, you know, complexities everywhere. There are quite many things to consider, especially when you're dealing with a variety of frameworks, you have to know how to containerize them for Kubernetes, how you should deploy it, how should the inference or inference request be formatted. Then, of course, making sure inference response times are acceptable, especially when deploying at scale. So, these are some of the things that the user would have to navigate when considering production grade model serving on Kubernetes. So, back to K-Serve. For those who were here earlier in the morning, Alexa did give a good introduction to K-Serve, so I'll just keep it brief here. So, K-Serve is an incubating project in the LF AI and data foundation. This was recently added. Originally, it was under the Kubeflow umbrella, and it allows highly scalable and standards-based model inference on Kubernetes for trusted AI. So, the key features are the serverless inferencing workload. You can serve your models in a serverless manner using Knative. You don't necessarily have to use Knative and Istio with K-Serve. You can also use what we have, what we call raw deployment mode. This just uses the native Kubernetes resources like deployments, ingress and services just to kind of deploy your model. Then, one good thing about K-Serve, it provides the standardized inference protocol, what we call the V2 protocol, and this is something that's been collaborated and developed in the community, especially with folks from Salden and NVIDIA. So, the good thing about this protocol is that it'll work, your inference request will work across all these different cloud providers and, I guess, model-serving runtimes like Triton or Salden's ML server. I think even PyTorch's Torch-Serve supports the V2 protocol. Of course, K-Serve provides an easy-to-use YAML where you just supply a storage URI, which is the endpoint to your model file. Then, you can actually just, in K-Serve, we'll handle the rest and we'll determine what type of container or image that model should be loaded into. So, you apply a YAML, then K-Serve will spin up the Kubernetes deployments necessary to serve the provided model for inference. So, as mentioned earlier in Alexa's talk, there are scalability concerns with this pod-per-model paradigm. Each inference service has a resource request and, of course, how Kubernetes is resource allocation. You'll quickly hit the limits of your node. Then, of course, Kubelet has a maximum, I think the default maximum pod limit is 110. I think Kubernetes best practice dictates or suggests that you shouldn't have more than 100 pods per node. There's also the aspect of the maximum IP address limitations. Each specific pod and replica of the pods, each has its own IP address. You might quickly, especially if you're dealing with thousands of models, you might quickly exhaust your IP addresses. So, all of this leads us to the notion that you need to be able to serve multiple models in a single pod per container. Oh, sorry, you need to serve multiple models in one singular pod. So, we want to reduce the overhead and avoid hitting all these limitations. So, this leads us to ModelMesh. This is something that I worked very closely with, and this is something that was open sourced last year as a part of the case of organization. It's intended to serve as a multi-model serving back in for case of. So, ModelMesh has been used in production for several years, underpinning a lot of the Watson services, like Watson assistant, natural language understanding and discovery. For stuff like that, you have maybe hundreds of thousands of models, and all of these models are being used. Some people create the model and might not perform inference on it for several weeks, and so you don't necessarily have to keep it available in memory. So, ModelMesh has the capability to page out the dormant models. Pod's not receiving any, or model's not receiving any traffic, and can load it just in time if it does receive traffic. This kind of falls in line with the K-native serverless approach, but just using it in one single container. So, ModelMesh handles the intelligent loading and unloading. He's trying to find an intelligent trade-off between responsiveness to users and their computational footprint. So, the overall architecture, I'm not sure if you can see that. So, a user would, you know, so as multiple predictors or models are deployed, compatible serving runtime deployments are scaled up to host these models. So, you might see several serving runtime deployments. Default, it's two deployments, or two replicas per serving runtime. So, in each pod, you can see that there are three containers, the ModelMesh container, adapter, or polar container for retrieving models from the external object store, like S3 or Google Cloud Store. And then, of course, the model server itself, which are third-party serving runtime, are infant servers, like Triton or ML server. And all of these support loading multiple models into one container. So, the ModelMesh sidecar, they're all connected together. They help route traffic, which is performed through the singular Kubernetes service at the top. So, external inferencing requests are made via the service and the Ingress ModelMesh pod, which could be any of these routes sent to other pods as needed in order to, yeah, and so that kind of creates kind of like your ModelMesh. And in order to coordinate operations and persist model and instance states and at CD cluster or instance is used. Yeah, so briefly, some of ModelMesh's key features are the cache management. So, ModelMesh treats the set of pre-provisioned serving runtime pods on Kubernetes as an LRU cache. So, ModelMesh decides what models are loaded or unloaded based off of the usage recency and current request volumes. Next, Model Placement will be done in such a way to balance both the cache age across the pods as well as the request load. So, this just means more commonly used models are placed on pods that perhaps aren't getting as much traffic. And models that may not be getting much traffic or traffic is sporadic will be placed on less utilized pods. Then there's resiliency, you know, if a, for some reason, object source unavailable, it'll keep retrying. Then there's operational simplicity. So, rolling hand, rolling updates are handled. You can, traffic will continue to be routed to an older model as you're deploying a newer model. And once the newer model is loaded in the cache and available to use, then it'll switch traffic to the newer model. So, yeah, so ModelMesh does have quite a lot of moving pieces. And, you know, especially when running something like ModelMesh in production for thousands and thousands of models. And there are several aspects that can really impact the effectiveness of the platform and, you know, from varying configurations, environments, runtimes, and, you know, the different types of models you might load, right? And so, this is why performance testing automation and monitoring are critical. So, as we develop ModelMesh and as pieces change, you know, performance testing automation becomes very important. So, you know, this is a platform that serves hundreds of thousands of models. So, we need to be able to validate performance with different config and different code bases, you know, different dependencies, models, and runtimes. Then we also need to ensure that new code doesn't introduce performance regression. And we need to keep doing this periodically throughout the development cycle, right? We don't want to test at the very end of the release and just be caught by surprise by some performance degradation. So, you know, this is all vital. And, you know, by creating automation for this and adding performance testing as part of our CICD process and by creating tolling for effectively monitoring our ModelMesh performance, you know, this really helps with finding, you know, costly defects sooner. And so, this saves a bunch of time and money. So, here I'm going to pass it off to Ted, who will kind of show what we, what we, what our performance testing setup and what we use for our performance monitoring. So, as Paul mentioned, there are many critical reasons for running performance tests, right? So, now I will talk about how we actually run some of the performance tests. So, it all boils down to three things, at least for us. So, the first thing is that the automation framework should run repeatable tests, should be able to run repeatable tests for every code release. And then second is the load driver. It should be very flexible in writing the workload. And the same workload should be able to run tests in both HTTP and GRPC. And it should be fast and it should have low overhead. And we are currently using K6. We have tried other load driver before and K6 is what we're currently using. Metrics monitoring should be able to monitor CPU memory utilization from worker nodes and containers inside the cube cluster. Right now, we are using Prometheus operator. So, this is kind of like our environment, the performance test environment. So, we have three nodes. So, the first node is the automation. In the automation, we use KFP Tecton. This is a Kubeflow pipeline. And then we have K6 inside the Kubeflow pipeline as part of the task. And then on the second node, we have our Prometheus operator, which has the Gorfana and Prometheus and also Inflex DB. And then the third node is our model mesh. This model mesh deployed on only one worker node because I'm using a very small deployment here. So, let's talk about the Kubeflow pipeline. So, Kubeflow, as you may know, is an umbrella project for many other smaller projects. A Kubeflow pipeline is one of them. This particular Kubeflow pipeline is running on top of the Tecton backend. So, next, the K6 tools. The K6 low tools is, you know, an open source low testing tools. You support sending payload in HTTP and GRPC protocols. And it's written in GoLane. So, it's pretty quick. And you can define your workload in JavaScript. It also provides CLI, which you can run the script using K6 run. And then it also provides some functionality for you to check whether your test has met any threshold to pass or either fail. And then it also provides output integrated to sending output to Inflex DB. So, later on, we can use the Gorfana dashboard to visualize it. Now, we've been talking about GRPC, right? So, GRPC is the main API for model mesh. So, that's why we keep talking about GRPC. But the GRPC plus the protocol buff is usually faster than the REST plus JSON interface because the GRPC is built on HTTP2 and it can send payload in binary format. So, last thing I want to talk about is Prometheus operator. This operator provides many things, you know, by default, such as Prometheus DB and also the node exporter, which collects the Linux metrics on your cube cluster. And also, it provides Gorfana, which can visualize the metrics and you can view it on your browser. So, right now, I'll probably show you some demo. Let's switch to... So, let's take a look first. We have three nodes and then... This is because... Why is it so slow? Okay. So, yeah, we have three nodes. So, one node for... And then this particular node is for model mesh deployment. So, our model mesh deploy is running on a relatively small worker node. It has two Triton inference part. Only Triton... And then the node itself has eight CPUs and 32 gig of memories. And all the model mesh component is running just one worker node. And this is the Gorfana dashboard. So, this Gorfana dashboard is set up to monitor the model mesh metrics. And as well as the cluster level system metrics, such as CPU and container utilizations, CPU memory utilizations, and as well as the K6 metrics from the load driver. So, at this point, let's start a test. And this is the Kubeflow pipeline UI. So, I'm going to start my test using the Kubeflow pipeline UI. So, this is a pipeline. This pipeline has two components. So, first, the pipeline will deploy 20 models. And then it will run some load tests with the K6 driver. So, let's start tests real quick. Okay. So, while it's running tests, let me show you the number of models on my system. So, I have 10k models already deployed. And on the system, there are 5,000 MNIST ONIX models, as we can see. And then another 5,000 CIFAR PyTorch models. All these are very small models. And when the test is running, you can see the log here. So, you can actually see 20 more MNIST ONIX models has already been deployed. Now, it's running the performance test using K6. So, the K6, while it's running, it's sending metrics to influx DB, which we can graph using the Grafana dashboard later. Okay. So, let's switch back to the dashboard. So, these are on the dashboard. These two numbers are the same because it's basically saying that, okay, I have 10k model registered and 10k model also loaded in the cache. And these two numbers here are the number of models on each Triton runtime parts. So, there are roughly 5,600 and 5,400 here. You may be wondering, this don't add up to 10k, right? Because some models are replicated across the two parts. And right now, as the test is running, you can see that each Triton is taking 1,000 requests currently. And the request latency is single digit below 10 milliseconds or so. And also here, you can see the number of total requests coming in from the model mesh server side. And so, this is all the server side data. And as the test is running, you can see the CPU also increased. The serving runtime containers also have higher CPU, also the Triton containers. CPU become higher. And this is the memory. So, you can see Triton uses the most memory. Okay. So, I think our test has finished. Let me just zoom in here so you can see a little better. So, this top three charts are from the K6 driver. So, the test I was running, I was running roughly 100 second low test. The first 30 seconds are like ramp up. So, you'll see gradually ramp up and then run for 60 seconds and then ramp down. So, on the client side, I'm using 40 virtual users to drive this test. And these 40 users are running sending payloads as fast as they can. Sometimes, if you like to emulate real users, you may want to add some delay or timers for each user. But for this test, I'm only curious about what's the maximum capacity of my model mesh deployment. So, as you can see, 2K, it can take 2K requests per second for this particular model I'm testing. And it's very similar to what the server, the model mesh server metrics as well. But the GRPC request duration is a bit higher. Like here is reading about 15 millisecond to 20 millisecond. But down here is a single digit. So, that's because the client side may have some network delay and may be spending some CPU time for processing the response data. So, it's always good to look at the metrics from both sides. Now, the error, we didn't have any errors or the error is empty. The error is nothing. So, some of this chart here shows that when, you know, initially when we load 20 more models, it shows that model loading and unloading activity here. And also the loaded model size here. We don't have any activity here in the cache miss because basically all the models are already in the cache. So, to test this, I have some tests also made for this corner case. And I think we can run it just to show some cache misaction. Okay. So, I'm going to run this pipeline. Only has one test in it that's running a K6 stress. And for this test, I'm going to run for five minutes and I'm going to send 3,000, send payload to 3,000 PyTorch, Cypher PyTorch model and 3,000 MNIST Onyx models. So, all I'm doing here is that if I hit enough, I'm trying to hit some models that have not been used for long time and then to see if that we can get some cache misses action. So, let's switch. So, while this is running, I guess we can take some questions because we're almost at the end of our time. So, we had a question on Slack. That was the question was, should I run model mesh if I just have a few models? I think that was the question. Let me double check. Yeah, should I consider model mesh when I just have a few models? That is before running into KS limits on pods, IPs, etc. What would you recommend? Oh, it's my mic. Okay. So, I think if you just have a few models, so the main, the main, I guess use case for model mesh is for the high scale, high density, right? You have thousands and thousands of models. If you only have a few models, I think just using this case for single model serving is fits the bill well enough. You can easily plug in transformers and explainers just using case for single model serving. But it's when you really need to kind of load in and out of memory multiple models just to make room and well utilize your cluster is when you want to use model mesh. Okay. It's like, so if you're not already hitting limitations, and I think just using case service fine, it's perfectly fits the bill. Awesome. So, Ted, I wonder if you want to explain. Yeah, sure. So, if we don't have any more questions, so let me show you. So, yeah, because we are sending tests to so many different models, now we are actually getting some cache misses. Now, in real life, this may not be a good thing to do to your server, because this actually break the LRU cache. But for this for the test, we want to test this is how you can test. And here because I'm sending so many payload to two different models, so I'm getting two different response time here for two different models here, so I can actually toggle different models here or both. Yep. So, that's it for our talk. So, today we have show you what model mesh is, and then we have show you how to run performance test on model mesh. And we're actually using some of those tests in the public model mesh GitHub repo for our nightly build process. And we do have a few more talks coming up that's related to this co-located event. So, feel free to participate those talks. Thank you very much. Thank you. I think we have time for one or two more questions. Is there anyone that has questions in the audience? There's a question in the back. Thank you very much for the presentation. In the beginning of the presentation, you mentioned that you could post models or to save resources and money. So, I'm curious how do you do this? How do you post models which you're not going to use and then you recover them? So, one of the main components of model mesh is the third-party inference services or servers. So, we have NVIDIA's Triton. Could be Intel's OpenVINO or Seldin's ML servers. All of these servers support loading, have a load and unload API where you can load in models and unload models. And so, model mesh acts as kind of like a management layer on top of that to load and unload based off of usage demand and based off of usage recency. So, pretty much model mesh handles the loading and unloading. It treats all of the serving runtime deployments and these servers as an LRU cache. If it's already used of all of the memory capacity allocated for these models and it'll bump out the least recently used model using the unload API for the model servers, then we'll load in or scale up the model being requested. And so, that's kind of the notion behind model mesh and kind of making use of all the cluster capacity as much as possible. Do more with less.