 Hello, everyone. My name is Paul. Thank you for coming back to this session after a good lunch. So yeah, welcome to this, the day of Edge. Model serving at the Edge made easier is the name of my talk. I am Paul Van Neck. I am an open source software engineer at IBM based in Silicon Valley Labs in California. My co-speaker could not be here today. My co-speaker, Animesh Singh. So it'll just be me. But he will be with us in spirit, or more likely, probably asleep in California. So just to kind of give an outline for the talk, we're going to first kind of go over Edge computing and the whole model serving overview, then the whole notion of using Kubernetes for model serving at the Edge. Then using Model Mesh, something called Model Mesh, which I will introduce to easily manage kind of higher density Edge model deployments on Kubernetes, on Edge. So yeah, we're going to an example of deployment and some challenges that were encountered and some lessons learned. So yeah, so as you all probably already know, we are in the midst of a new generation of computing. Edge computing, that's why we're here. And so with each generation of computing before it, Edge will impact every industry and force IT departments to adapt to new architectures, new deployment models, and business models. And so here we focus on bringing computing to offices, distribution centers, manufacturing sites. And so this kind of create this decentralized approach to application design and bring with it new challenges of workload management across thousands or millions of Edge nodes. And so it helps to visualize Edge computing through the continuum of physical infrastructure from centralized data centers to devices, right? So to the far right of the diagram, that shows the centralized data centers, representing cloud-based compute. And here, cloud resources are practically unlimited, whereas device resources are inherently constrained. And so moving along the continuum from centralized data centers towards devices, the first main edge here is the service provider edge. And this is distributed and brings Edge computing resources much closer to end users. And so moving even more right, we have the user edge, which represents a highly diverse mixture of resources. And as a general rule, the closer that Edge compute resources get to the physical world, the more constrained and specialized they become. So this user edge space is essentially the focus for today. And so as we look at workloads that benefit from running at the edge, we see things like business logic applications, network modernization. Then the focus where I want to kind of focus on today is a notion of stretching AI and analytics to the edge. So in this scenario, a user's predefined train models can then be deployed to the edge. Additionally, customers can train the models on the edge as well in real time by capturing the data and having the models be retrained as they operate at the edge. And so as we look at the machine learning lifecycle, we see that ML models are constantly being updated. You do your training, your retraining, and deployment. And so model deployment is a very key aspect as without deployment, how else will you consume the model and bring AI to your systems? And so as it turns out, actually doing production grade model deployment and inference is actually plagued with complexities, there are quite a number of things to consider. And here are some of the questions that a user might have to navigate when considering model serving approaches. But again, the focus for today is model serving at the edge on resource constrained devices. Single board computers are in systems on modules. Sometimes it's a necessity to use on-premise and distributed compute resources that are closer to end users. So user edge deployment in this sense comes with many benefits. Data locality is the first one. We perform inference where the data is. This ties in the quicker inference response times and bandwidth consumption savings. So you no longer have to necessarily send your data or your inference payload to the cloud or across some expansive network. So this also has the benefits of increased security and data privacy. So remember, we still have those complexities we want to tackle, those complexities we saw on the previous slide. So how do we handle these complexities at the edge? And so one of the ways is, of course, Kubernetes at the edge. And that's kind of the topic for today. So we want to leverage orchestration capabilities of Kubernetes at the edge. So as you all may know, there are several ways to use Kubernetes at the edge. You can deploy a whole lightweight cluster on the edge devices using things like K3S, micro K8s, and something called MicroShift, which is kind of a relatively new project that offers a small form factor open shift designed for field deployed, low resource devices. And so multiple edge clusters can be deployed. And you can do something like cluster management tools, such as open cluster management, to manage each individual cluster. And then there's the option of having the control plane on some cloud somewhere. And this will be your main Kubernetes cluster. And then you add edge nodes managed by something like Kube Edge. And so these edge worker nodes will be available for deployment. And so for us, I'm going to go with the edge deployed Kubernetes clusters to keep everything at the edge. And I guess one typical approach for deploying apps or models on the edge is to just containerize the model server and create a Kubernetes deployment. And maybe you might mount the volume and which contains your model files, your assets. And eventually you'll do a kubectl apply and deploy the actual containers. So this is essentially what other model serving platforms, I guess, most Kubernetes-based model serving platforms inevitably do. However, this can be quite cumbersome, especially when you're dealing with a lot of model frameworks. Different TensorFlow and PyTorch each have their own images and different arguments you have to worry about. So it's kind of like a rabbit hole. You have to jump in and might be a little daunting. And so you might lose out on certain aspects. So feature richness of scale to zero or you might have varying inference request formats or protocols that are used on different model servers, which can be annoying. So can traditional model serving platforms for Kubernetes extend to edge devices? And so first, I'm going to introduce kServe. And so kServe started off in the Kubeflow umbrella as the model serving platform for Kubeflow. So it is now an incubating project in the Linux, the LFAI and Data Foundation. And so what it is, it's a highly scalable and standard-based model inference platform on Kubernetes for trusted AI. Typically, I mean, traditionally, you deploy this on the cloud. You have underlying kServe is Knative and is TO. This is an optional layer. You can either deploy models using Knative or you can use a raw deployment mode where you just deploy models using just standard Kubernetes resources like deployments, ingresses, and services. So some of the benefits of kServe, it provides an easy to use interface and or CRD called an inference service for deploying models given a format and storage endpoint. So typically, you just need to provide these two items and you can give the kServe. And kServe will know what to do with it. We'll find the appropriate images to load and we'll load it into the container. Another advantage of kServe is that it revolves around a standardized inference protocol. So this is something that the community has helped shape. People from other model servers like NVIDIA, I think PyTorchServe supports this protocol and Seldin's ML server. And so by implementing this protocol, both inference clients and servers will increase their utility and portability. You can seamlessly integrate on other platforms and perform standardized inference using this protocol. So some of the main ones are Triton server, PyTorchServe, and ML server. And these are some of the standardized REST or GRPC endpoints. So what this kServe brings, it's not necessarily good for edge. There's resource overhead because of sidecars might be injected into each pod. Having an independent model server per model or a model per pod, it's a lot of resource consumption being done. So it doesn't make the best use of you're already resource constrained on these edge devices. And so we need a way where we can serve multiple models in a singular pod or container. And that's where kServe's multi-model serving backend for multi-model backend comes into play. So this is a project that was open sourced by IBM last year and is joined to kServe organization. So as users were getting hit with these scalability issues with the traditional model serving on Knative, we decided we needed to open-source or consolidate on a path or approach for handle multi-model serving, aka having multiple models inside a container. So ModelMesh is a platform. Yeah, it's been running inside IBM or in production for quite a few years now. And it's the backbone for quite a few of Watson's services. You have Watson Assistant, Watson Natural Language Understanding, and Watson Discovery. And so ModelMesh allows for multiple models per container, but it allows models to be paged out if it's not being used or loaded just in time if a request comes in that needs that model. So it has some parallels to Knative and serverless, but it's kind of just relegated to a single container. So yeah, so it does strike an intelligent trade-off between responsiveness to users and their computational footprint. So I would say it does make the use, makes excellent use of resources on your cluster. So the overall architecture might look something like this. So when a user applies an infant service YAML containing the model details, the ModelMesh controller will select the suitable serving runtime pod in which to host the model. So the pods for these serving runtimes, that's kind of the important part where you see the green and orange. So these pods are typically they contain three containers. You have the model server container, which is typically a third part of the infant server or server like Triton or ML server. These support loading multiple models and have unload and loading endpoints, which we can use to kind of dynamically alter which models are deployed. And so then we have the ModelMesh sidecar container, which handles the model management and handles both control and data plane request routing. Then we have the puller, which handles pulling models from external object stores or endpoints. So this handles pulling it into the actual cluster, pulling model files into the cluster. So a single Kubernetes service at the top points to all pods across all deployments and external inferencing requests are made via the service and the whichever ingress ModelMesh pod it hits, ModelMesh will determine where the model is actually loaded, which actual model server the model is actually deployed in and will route the request as needed. So just a side note, LCD is used as the coordinate operations in persist model and instance states. Yeah, so like I said, the model servers are kind of an integral part of ModelMesh. So K-Serve or ModelMesh has serving runtimes. This is a CRD that is used for defining model serving environments and which container images should be used or loaded and what the supported model formats are. So currently out of the box, these are the two main ones, Triton inference server and ML server. And so you can pretty much use anything. I think we're working on. We have Open Veno. We have, I think, Torch servers in the works. But as long as it supports dynamic loading and unloading, ModelMesh should be compatible with it. And so as mentioned, ModelMesh has been running in production cloud environments for quite a while. So the question that I guess I was thinking about is, is it tenable to bring it on the Kubernetes on the edge? So let's talk about some of the advantages it might provide. So first, through K-Serve's inference service interface, users are able to easily deploy multiple models into singular serving runtime containers or pods, which drastically reduces the resource overhead compared to the single model per pod pattern that is typical with traditional model serving. So each pod typically has resource requests. And it would be quite easy to hit allocation limits when dealing with even just a few models on resource constrained devices. So another thing we pay attention to that ModelMesh is one of the key features of ModelMesh is its whole aspect of cache management and, to some extent, high availability. So ModelMesh treats the set of pre-provisioned pods on the Kubernetes cluster as an LRU cache. And so ModelMesh decides which models are loaded or unloaded based on usage recency or current request volumes. So if you have multiple edge nodes, a model might be scaled out to have copies on each of the nodes if there are serving runtimes available. And so if a specific model is getting swamped with traffic, it might scale out to produce ModelMesh, might decide, hey, let's load this into additional serving runtime pods to create more availability for that model. But if a loaded model hasn't received any traffic in a while and a new model comes in and the cache happens to be full, the least recently used model will be evicted from the cache. That means unloaded from the model server's memory to make room for the new model to be loaded and used. So with this cache management, ModelMesh works well for handling unevenly distributed inference request load. You might have 20 models registered, and perhaps only five are more commonly used, kind of like the 80-20 role, where 20% of the models handle 80% or handling 80% of the traffic. And so some additional highlights of ModelMesh is it's intelligent placement and loading. ModelMesh tries to balance the cache age across the pods as well as the request load. So again, it's striking a balance, trying to keep it balanced. And there is a priority loading of models. So a model with a request waiting for it will be bumped to the front of the line if there is a request waiting on it and if there is a loading queue. So this all ensures we balance resource usage across nodes or edge devices, as well as improved responsiveness. So after having deployed models, what about day two operations? So a key aspect is the operational simplicity of ModelMesh. It supports rolling updates automatically. So if you deploy a new model, the old version of that model will continue to receive all the traffic until the new model is loaded into memory and is ready for inference. And at that point, the traffic will be shifted. So yeah, this all sounds great. So I wanted to try ModelMesh on edge, and that's what I did. So here, I unfortunately cannot bring my Jetson Nana to this event, but that's a picture of my Jetson Nana on the right. And so on my Jetson Nana, which has a quad-core ARM64-based processor and 4 gigabytes of RAM, I deployed MicroShift, which is a small-form factor, OpenShift, as I mentioned. It's a project that's currently being developed by the Red Hat Emerging Technologies Group. And for those that don't know what OpenShift is, it's kind of the Red Hat's Enterprise Kubernetes platform. So in any case, MicroShift does work remarkably well, especially on this Jetson Nana. It runs as a single binary and runs on less than one gigabyte of RAM and generally less than a core, a single CPU core. So with MicroShift, I have an all-in-one minimal installation of Kubernetes on my Jetson Nana. And on top of this, I deployed ModelMesh as kind of like a standalone installation. So since ModelMesh is a part of K-Serve, but in this case, I don't care. I didn't really care too much about the single model serving. So I opted to not install the K-Serve controller and just use the ModelMesh controller, because right now ModelMesh controller is its own standalone controller. And it saves resources, trying to squeeze out the most from your device. So with ModelMesh, I was able to deploy a mix of TensorFlow and Onyx machine learning models, which by default with ModelMesh automatically maps to the Triton inference serving runtime. Because Triton serving runtime supports quite a few of the model formats. And performing GRPC inference was pretty fast. ModelMesh, the primary API, is GRPC. And so even if a model was loaded in memory, you get less than a second inference response time. And then if it has to unload and load the model, you might have to wait a few seconds. And so after that, I wanted to try higher density model packing. So I wanted to apply more inference services to the cluster to register more models with ModelMesh. And so I deployed around. So in this case, in my scenario, I deployed around 17 or so dense net Onyx models. So these are about, I want to say, about 36 megabytes each. They're for image classification over, I think, 1,000 or so categories. And so on the right, you have a simple diagram where it kind of shows what's kind of deployed on the ModelMesh side of things. And so I exposed the service as a node port. And from anywhere on my network, I can send inference response or inference request to Garner to kind of get whatever inference response for my request. So again, this uses the ksrv2 protocol, gRPC specifically. And so with this, after applying, yes. When I do kubectl get inference services, I was able to get a list of registered models. And all of them are listed as ready, even though not all of them are necessarily loaded in memory. So for those familiar with gRPC, usually you have a protocol that kind of corresponds to the API you're using. So with that, I was able to generate a Python client for which I could send gRPC inference requests with arbitrary images for classification. So in the sample command, I send an inference request to one of the deployed ONIX models and receive my inference results based on, this is my cat, Mabel, who volunteered or got volunteered to be a part of my image classification. So the response times, even with all of these models, response times still sub-second. But if it is a cache miss, so again, not all of these can be loaded in the memory at once. If it's a cache miss, there is quite a few seconds of latency involved, because again, ModelMesh has to unload a model to make room for the target model of inference. So now, of course, even with just a single device, it's an all-in-one cluster installed with ModelMesh. It was very, I was pleasantly surprised to see I was able to deploy and host quite a few models, more than I expected, actually, with multiple devices or a separate control plane. So my control plane and my worker know they're all the same device. But still, I was able to get quite a lot out of this installation, especially on such a resource-constrained device. And so with multiple devices, for instance, if you maybe use a K3S cluster multi-worker node, you could definitely get larger cache sizes for model loading and the opportunity for high availability. You can have multiple copies of models loaded on different nodes. And even more compute power, dedicated compute power for the model inference servers themselves. So yeah, this was overall a very interesting challenge for me. Since I'm a ModelMesh developer, I am looking more into how we can bring ModelMesh to other areas. And so generally, when you're dealing with, I guess, porting things to edge devices, you generally run into the lack of arm builds and support. For instance, KServe has some dependencies on some Python packages, which don't necessarily have arm support. And a lot of container images are only built. You might look for a container image for KServe on Docker Hub right now. That'd just be only AMD64 or x86. And so yeah. And so right now, I guess a lot of the images are huge over a gigabyte, especially. So there's a lot of trimming down that needs to be done. And so that's something that I kind of put into and considered when I was kind of building my own arm-based images. And of course, when you're dealing with edge devices, the whole need to create kind of smaller models and are doing some techniques like pruning or quantization. That's also something to be considered. But anyway, if you want to learn more, I do have some links here. Definitely try out any of these technologies. I do think model mesh, depending on your use case, can be a tenable solution for model serving on edge. And with that, I thank you. And I'll take any questions if there are any.