 Yeah, thanks everyone for attending. My name is Rafael. I'm a software developer at IBM. I joined IBM about a couple of years ago where I was a developer on a consulting team and maybe a couple of months ago or so, I joined an open tech team. Maybe you're here for Tommy and Yee Hong's presentations before me, but I work with them. And my focus is the projects K-Serve and Model Mesh, which you probably also heard Tommy speak a bit to. So the title of my talk is a bit specific, but there's a few different things that I want to touch on. So I'll start by talking about model serving in general and some of the considerations that somebody has to think about when it comes to implementing something like a model serving platform. Then I'll introduce the projects K-Serve and Model Mesh and I'll go into a bit more detail about some of their features and more specifically Model Mesh in the use case that it takes care of. Then I'll talk about monitoring, more on the Ops side and monitoring things like a Model Mesh deployment and some of the resources. And then I'll get into custom runtime creation where you might bring in your own custom model with some certain features, deploy it on Model Mesh, and then that'll be more of a demo towards the end. So model serving in general is a very important part of the ML Ops pipeline. Obviously it's not enough to just train a bunch of models and hope people use them. You have to make some type of interface with them and make them consumable by end users or applications that want to incorporate the AI that you've trained into their services. And usually what this looks like is creating some type of API that clients and applications can make requests to to get what they need. So that strategy is more commonly described as like deploying model as a microservice. And so again, this is building an interface to a model and exposing it to clients as an API endpoint, whether it's REST or otherwise. And this strategy works because many developers are used to doing this strategy already for existing software stocks. But when it comes to applying this to models in general, there's a few considerations and maybe difficulties to overcome. For example, how do we containerize a model like an application? Our inference response time is acceptable, right? Our users and applications getting what they need quick enough. And then of course there's the complexity that comes with all of the different frameworks that are out there in model formats like TensorFlow, PyTorch, Triton, NVIDIA's Triton or Seldin's ML server and so on. So that's where a project like KServe comes in because KServe essentially takes care of all of these things for you and KServe is this model inference platform built entirely on Kubernetes. And so what KServe does best is it takes a bunch of projects and kind of puts it all into kind of a one-stop shop for you. But one thing to emphasize with KServe is that it uses the standardized inference protocol across so many different ML frameworks. So like I mentioned before, there's so many frameworks and how do you interact with all of them? There's this big drive to find a standard in all of these communities and they work together to do that. KServe also has serverless inference workload with auto scaling and features like that. And then something that even just before me, Tommy mentioned KServe upon deploying models comes with a bunch of pluggable metrics and features, especially around explainability, things like drift detection and bias detection and things like that. Now, while KServe handles the vast majority of model-serving use cases, it works for the most part upon this pod per model diagram, which essentially means that you deploy a model, there's a pod deployed with that. You deploy another model, that means another pod. And there's this trend among organizations to deploy larger and larger numbers of models and there's a couple of reasons. One might be better accuracy, so you can imagine language models trained for chatbots. A chatbot for a post office might be very different than a chatbot for a bank because of different contexts, different vocabularies used and that's just one example of why it might be more accurate to have separate language models for each. And then on top of that, you might think about different actual languages as well and translating those things. And then another reason is also on the data privacy side where training models separately, you get to separate or isolate user data or different types of data between models. So when we start to think about this pod per model diagram and you start to think about maybe use cases where you get into not just thousands of models, but maybe tens of thousands or hundreds of thousands of models, you can start to maybe see how you can hit these hard limitations that comes with these things. Like for example, compute resource limitations or maximum pod limitations and maximum IP address limitations because you're assigning each of those models or pods IPs as well. And so that's where a project like Model Mesh comes in. And so Model Mesh is used for this specific use case where you're running tens of thousands of models and that's because what it does most special is it breaks the pod per model paradigm and it deploys multiple models across single pods. And so Model Mesh has actually been used in production by IBM for several years and it's been, I guess, and it continues to be the backbone for many Watson cloud services like Watson Assistant or Watson NLU and Watson Discovery, among others. And it was recently open sourced by IBM, I believe, within the past couple of years. And again, it's designed for these high scale, high density, you know, again, think about tens of thousands, hundreds of thousands, maybe small to medium models that are used in a very random fashion. I mean, it's made for these use cases. And the special thing that Model Mesh does is it intelligently loads and unloads models to and from memory in an attempt to optimize between responsiveness. So a user, you know, needs access to a model now as well as their computational footprint. So maybe you could think of a model that isn't used so much, but how do you minimize its impact to computation or compute? So at a high level here are some of the features that Model Mesh brings with it. And this might give you an idea of a bit about how Model Mesh actually works. The first is cache management. So Model Mesh, the controller, essentially looks at all of models like an LRU cache. LRU meaning least recently used. So the way that it loads and unloads copies of models are based on two factors. One is usage recency. So, you know, how recently a model has been inferenced on or attempted to be accessed. And current request volume. So how heavily used or how popular a model is. From there we get into intelligent placement and loading. Implacement refers to the idea that a model doesn't just have to be on one pod. If a model, depending on its cache age, as well as its request load, so if it's heavily used, it can actually be scaled across multiple pods and therefore more accessible. And then when it comes to loading, concurrent model loads are always constrained or queued in order to minimize impact to runtime traffic. And then, of course, there are the two bigger features like resiliency, which is simply referring to automatic recovery. So failed model loads are automatically retried in different pods. And then operational simplicity, which is all about rolling updates, which is handled automatically and safely. So there's this concept of V model in Model Mesh where an endpoint to a model is only updated. So say you have some endpoint that's already pointing to a model and you want a new version of that model. That endpoint will only be pointed to the new version once that model is successfully deployed. So this obviously avoids situations where you're pointing to a model that's not yet successfully loaded and failing or something like that. So this is sort of at a glance like Model Mesh's architecture. I guess I'll point your attention maybe to just the left here where we have runtime X deployment and the deployment is basically the set of pods. And within the pods, there are these three containers. And those three containers are important. The first is the model server. And of course, this could be NVIDIA's Triton. This could be Torch serve. It depends, this is what actually hosts the models. Then we have the puller container. And the puller is actually retrieving models from S3 or from some storage. And it's retrieving it in the way that the model server, whatever it may be is expecting the model to be in, in its format. And then there's the mesh container. And the mesh container works across pods and essentially that's where all the logic for model networking, placement, caching and so on is. And something else special that Model Mesh does is it actually brings its own at CD instance. And in here is where there's metadata about the models. So for example, their sizes, how long they've been around and things like that. So that gives you a bit of a background about Model Mesh and some of its features and maybe a bit about how it works. So now I'd like to get into monitoring. And maybe this is a little obvious, but there's good reasons for monitoring something like a model serving framework or a platform. Especially in a high density model serving environment to be looking at resources and usage and looking at metrics like cluster utilization or the sizes of models and if they're being loaded among other things like inference request rates and all of these things. By tracking these metrics, one can obviously react and make changes to resources or make other actions as needed. For example, one might increase cluster capacity if it's nearing that capacity or have to debug models that are failing to load if you can see that. So Model Mesh actually makes it fairly easy to monitor all of the pods metrics and this is because the pods actually expose metrics via an HTTP endpoint. And so at a high level, these are the three steps that one would go through to set up a dashboard or have access to the various metrics for Model Mesh. And the first is to set up the Prometheus operator. Prometheus is essentially a monitoring system and the operator is a Kubernetes native managed version of that. And this will monitor the namespace in which Model Mesh is deployed. From there, you simply create a service monitor which is a custom resource definition provided by Prometheus, which will discover all of the relevant pods and then scrape all of the metrics that are exposed by those pods. And then you simply access Grafana, which is sort of like a dashboard visualizer system and import the JSON template dashboard that's provided in one of the repositories. And I reference a few things that are linked to at the very end of my presentation. But here's an example of what that Grafana dashboard might look like and this is just the first half. And I think it's, yeah, you can see it. So there's a lot of data here and I won't go through all of it, but essentially this is actually a sub-component of something at IBM that's in production. And firstly, you can see something like cluster capacity and utilization right from the start. On the left, we can see model counts. And something interesting about model counts is that in green, we could see that there are 60,000 managed models. I think you could probably make that out. And these models are managed, but in yellow, we can see that there are maybe about 5,000 loaded models. And there's a difference there because while there are 60,000 accessible models, the loaded models are the ones that are actually in cash or in memory ready to be accessed or inferenced upon. On the right, this is a bit messy here, but model counts per pod at the very least shows you all of these different colors or different pods and it just shows you how dynamically over time the number of models on a pod is changing pretty dynamically or pretty quickly as well. So this is just some of the what's going on behind the scenes and what model mesh is doing. Then of course we have on the bottom half of this, we have metrics about API request rates and their sizes as well as response sizes and so on. On the second half of this dashboard, we have some more API stuff and then we have some cache misrates and cache misrates are essentially when a model might be inferenced upon but isn't yet loaded and that's just the stat for those who are monitoring this and then at the bottom left, we can see the model loads and unloads per five minutes and it's fairly symmetrical but you could see many loading and evictions from pods over time here. So like you'd see loaded model sizes and loaded, sorry, model loading times as well. And Grafana, if any of you have experience with it, it's incredibly versatile and kind of customizable so it's not like this is exactly how your dashboard needs to look nor is this only all of the metrics that you can see but there's, or the only ways that you can visualize it. There's a lot of customizability there. So that is monitoring on model mesh and now we'll get into serving run times and so a serving run time is essentially a template for a pod that can serve one or more model formats and so this is basically like if we think back to the architecture, this is what we would have our pods, we would have our model server, our polar and our mesh containers all within this and this is basically what does the hosting and the inferencing of a model. And so out of the box we have integration with the following model servers, we have Triton's inference server, Selden's ML server, OpenVINO and Torch server but there are situations where the out of the box integrations might not be enough. Of course, we're working on bringing more in but there are a few reasons for why you might want to create your own custom run time and model mesh of course makes it easy to do that. And the first is of course, if the framework is not yet supported, maybe you also need extra custom functionality needed and the out of the box integrations, you have your own custom model that does a bit more than what the out of the box frameworks do and then of course maybe you have custom logic for inferencing. So the CRDs or the custom resource definitions of serving run times in model mesh allow for extensibility and you never have to actually make modifications to any controller code and this is something that I'll go into more detail and show you exactly how to do that. And then on top of that, if the desired custom run time is using a framework that is using Python bindings or Python based, it's even easier to create your own custom serving run time because you can use ML server to provide the interface, you bring your own model code and then you use model mesh sort of as the glue that binds them together and deploys your model. So here's a high level step by step on how to build your own Python based custom serving run time. So the first of course would be to prepare your model and interface it with ML server and I'll show an example of this. The second thing you have to do, you have to package your model class or the model itself and its dependencies into a container image similar to an application but this is essentially that server that's gonna be hosting your model and then lastly you create that new serving run time resource which is also gonna be pointing to that custom container image and maybe include any environment variables that you need. And so first of all, is anybody here watching like the World Cup or following the World Cup? I don't know, Japan has been, I mean, I think last night was a sad day if you know but I made this demo I guess a week or two ago with the assumption that Japan would not yet be out but anyways, bear with me. So here I have like a model and this is just a, this is just an example model but my trivial model that I made is simply inputting a JSON it's parsing the JSON and spitting back out what's included in it and then including some server response and you'll see that at the end but here we have an example of what the ML server interface would look like and so this is implementing ML servers model class. We could see two main functions of course it's a load function which is actually grabbing the model from somewhere and then the predict function which is going through the inputs, doing some work and then making its outputs. And again, my model is trivial it simply reads a JSON and spits it out but we're gonna pretend that this model is predicting the winner of the World Cup and that's why I call it a soccer results model. And so this is the custom serving runtime definition now and here there's a few important things to point out first of all, of course it's a serving runtime and then this is what I name it I name it a soccer runtime and so I'm pretending essentially that this runtime can only run models that are maybe soccer related or something like that and then we have a supported model format which is basically telling model mesh which are the formats of models that can be deployed using the serving runtime and I've made my own here which I named soccer models. Again, I only want models deployed here that are soccer related but this could be just as easily maybe like, you know, PyTorch's, MAR or SK learn, you know different types of model formats that are expected and then down here we have a pointer to or a spec form my custom image and this is the image that I created prior which takes my model and takes all of the dependencies and containerizes it and then of course we would include our environment variables if necessary depending on the runtime that we're or framework that we're bringing in and so that is like how simple I guess it is to create a custom serving runtime this is all you really need but we can obviously go further and actually attempt to deploy a model on it and at a high level these are the steps that you would take so first of all, when deploying any model with any serving runtime on model mesh the model has to reside on some shared storage I mentioned before in the architecture that the puller goes to storage and you end up pointing to where in storage and needs to pull or retrieve the model from and secondly we need to create an inference service to serve the model and an inference service is the main interface in both in case of N model mesh for managing models and it's interchangeable I guess with the model itself but an inference service actually represents the logical endpoint for serving inferences and then thirdly, once you've created that inference service you can check the status of it and I guess one extra thing here is that even if you've created your serving runtime and there are no models yet deployed on it pods won't actually be spun up until there is a need for them so you have your serving runtime you deploy your inference service and only then will pods actually be spun up with a deployment of your model so this is an example of that inference service and so here I've named this inference service the soccer results predictor because this is sort of my model and then in this inference service you explicitly write which format it's in and this is actually what model mesh uses to find which runtime to deploy this model using or this inference service using so here I write soccer model and as we know before my custom serving runtime supports soccer models then we have a spec for runtime and here I explicitly tell model mesh that I want this inference service to be deployed using the runtime soccer runtime but it's optional because model mesh will actually go through the runtimes and find which runtime actually supports the model format that you've explicitly written here and then lastly we have our pointer to the model itself and in this case I'm using local Minio instance and that's just because when you deploy model mesh if you ever go through the documentation if you use the quick start version of the deployment it'll actually come with a local Minio storage and on it with a ton of example models and so I went into that storage and I threw my model on there as well so at this point I can actually go to my terminal and I'll show you what this looks like in practice so first of all, so first of all this is just a local environment and again like a local deployment of model mesh and here we just have, we can see that we have the basic pods that come with it first of all and that's the etcd instance, the Minio instance and then the mesh controller once we deploy our models we would actually be able to see more pods of course but in this case now we have no pods or no models deployed so I have a custom soccer runtime spec here and this is very similar to what I showed before if not the exact same but again this is my custom serving runtime and here I've named it my soccer runtime I tell it I only want, the only model formats that it supports are these soccer models and then here are some environment variables just because that's the way ML servers that you have to pass for ML server to work with model mesh and there's more information about this in the documentation itself and of course I'm pointing to the image that I've, where I've containerized my model and so we will create this resource and now when I get my serving run times on this instance, we'll actually see that first of all, there was already a torch serve runtime ready to go and we can actually see that the model types that it supports is this PyTorch MAR but now we can actually see that I've deployed this runtime called soccer runtime we can see the model types that it takes and the container it's built on and that's because it's using the ML server interface now we can look at the inference service because now we have our serving runtime ready and we can attempt to deploy a model on it and so if we're looking at the inference service again, this is very similar if not the exact same as what I showed before it's a model that I've called soccer results predictor we have the model format here explicit and the runtime explicit to tell model mesh to only deploy this this inference service on this runtime and this is the format of the model and it'll look for the runtime that supports that format so I'll create this resource as well and now when we actually get our inference services the first thing we see is okay we have this inference service now it's called soccer results predictor but it's not yet ready and so if we describe it we can see that we have a status of it's not yet ready and we have a message here and this message is used for debugging purposes but in this case it's waiting for the runtime pod to become available and this is expected because as I said before even though we've created a serving runtime it doesn't yet spin up pods to deploy models onto it only spins up pods once we've deployed an inference service that requires them and so that's what's going on now and so if I look at my pods now we can actually see two two soccer runtime related pods and the reason it's two is because model mesh defaults to deploying two per serving runtime and now they're running and so actually if we look at our inference services now we can see that the inference service is ready we can also see an endpoint given to the inference service here and one more time if we if we describe the inference service now we see different information for example that the active model state is loaded and ready to go so now we will simply port forward the endpoint and it should be listening and ready for a quest and in this other tab I can look at the script that I pre-made which is just a Python script which is essentially taking the model that I've created so again I named that model a soccer results predictor and so here I'm saying I want to use this model and here are my inputs again that the model that I've created is a trivial one which is simply inputting or parsing a JSON and spitting out some information and here's that JSON that I want to be parsed and the name is we have something called name world cup winner and we have a message who will win the world cup and then we're creating the inference request and we're sending it and we're making the inference here and so if we run this we can see our result here the first thing is the output this is just metadata about the input and the response and then we actually see the contents which is the most important and first of all we see echo request which is again my model very trivial simply reads what the data coming in was and it says here okay name was world cup winner message is who will win the world cup we know that's correct and then there's a server response again I made this demo before last night but the server response is that Japan would win the world cup so clearly this model needs retraining or it's bias in some way but yeah so this is hopefully shows you a bit about an end to end kind of I've taken this model that I wanted it's a custom model simply a JSON parser my model mesh didn't have a serving runtime that I could deploy that model on and so I went and I created this custom runtime for soccer models and then I of course created the inference service for my model deployed it and now it's ready and ready to make inferences on so included in the slides are actually some of the commands that I also ran I'm nearing the end of this presentation but I have some links that point to for example the repositories for the projects I've talked about there's also a repository for monitoring and performance testing if you're interested in some of the numbers of how model mesh actually scales when it's under load for hundreds of thousands of models and so on there is also a if you're interested in communicating with us you know feel free to join the CubeFlow Slack and we have a channel called K-Serve which is also for model mesh related questions as well and then of course my contact information and on the bottom is a colleague who couldn't be here today but you know either of us are open to questions and yeah hopefully you learned a bit about model mesh in K-Serve and I'll take any questions if there are any but that's the end of my presentation awesome okay thank you everyone