 I'm Rafael. I'm a software developer at IBM. I've been there for for a couple of years, I guess a bit of my background before IBM. I was a data scientist. I don't know if anyone's from Canada here, but I worked for a large grocery company, maybe they are under fire a lot in Canada, but there was a data scientist and I was basically doing data analysis and maybe modeling on customer behavior. They have their loyalty program and you can see kind of what products they're buying and doing things around maybe recommending products to certain customers in order to grow their baskets and so on. Really I think most of the time I was just looking at my own data and just seeing how many bananas I was buying on a weekly basis and not getting through them all. But something I learned there and I bring it up because they were maybe earlier in their data transformation journey, so they hired a ton of data scientists like myself and essentially we had a bunch of teams and everyone was working on different projects, but models were basically trained and kind of operationalized locally. So if somebody wanted my models results the next week and to input it into the recommender system to recommend a product to a customer, I would run it at the end of the week and on Monday hopefully the results are there and then I would send that spreadsheet to someone. So I bring that up because it's interesting that I ended up transitioning I guess to IBM where I was a software developer and within maybe the past few months I joined the open technology team who I speak for now and I've been dedicated to this project called Model Mesh. It's actually kserve slash model mesh and learn a bit more about that because it's all about model serving and so that's basically what I'll be talking about today. A quick outline on I guess what I'll touch on, I'll talk about model serving in general, give a bit of a background and maybe some of the considerations that one has to have. Then I'll introduce of course the projects kserve and model mesh. I'm not sure who has anybody heard of kserve or model mesh. Okay, so there's some familiarity. That's good. So basically an introduction I guess giving some background to those projects and how they're intertwined maybe a bit of their history. Then I'll dive a bit deeper into model mesh itself and give an example of how to monitor a model mesh deployment using Prometheus slash Grafana and you'll see a dashboard there too. And then at the end we'll talk about bringing in your own custom models into model mesh and I guess extending some of the integrations that we have there and how you would deploy a model and that's a bit of a demo there too. So model serving as most of us probably know is a very important part of the machine learning life cycle or of the pipeline and there's a lot more involved in it than at least I originally thought when I worked for a couple of years ago as a data scientist thinking that if I trained my model and I'm sending my results to someone I'm technically serving my model and that's great. But model serving is a lot more to it. It's basically this idea of deploying trained models in production similar to maybe an application or software and making it available to consumers or the application in order to actually incorporate that trained AI into the service that I'm deploying. So one of the most common model serving strategies out there right now are deploying this idea of deploying models as a microservice and there's a few reasons for it. One of the probably biggest reasons is just the familiarity that many software developers have in deploying their own software stacks their own existing software stacks in this style. And basically we can think of a model as an application where we want to make some kind of interface to the model and expose it via some kind of API to make it easy to inference upon or make requests to. But there are tons of considerations that come with this model serving strategy some of which are listed here like containerizing the model. How do we do that but there's also questions about cost and resource efficiency. You know our models overscaled or underscaled are we efficiently I guess providing the resources to each of those models. How do we handle ideas like rolling out or rollouts so you know if I want to update a model version and in machine learning there's a lot of experimentation so this happens often. How do I point this new service to this new model and how would I revert it if I realized that model isn't working. Can I put a model out there without directing live traffic to it. And then there's the idea of handling different frameworks and model formats you know there's there's TensorFlow PyTorch. There's Triton server ML server do we have to rewrite the interface for every model that we write if we want to deploy it using some other runtime and so on. And that's where the project case serve comes in. So case serve of course answers all of these questions and it's meant to be this sort of one stop shop platform that takes care of all of those responsibilities for you. It's built entirely on Kubernetes making it you know serverless and without vendor lock in and so on. But there's a big emphasis in case of for standardization as well, meaning that the run times or servers like ML server Triton server towards serve all actually use the standardized inference protocol that case or provides. Meaning that if I want to deploy a model on a different server case serve is is or they've already implemented that same inference protocol and there's no need to rewrite the interface or anything like that. Case or does a lot more than that too. There's also pluggable support for for ideas like like pre and post processing scripts monitoring not just at the ops level but also at the model level so things like drift detection or bias detection. And of course model explainability as well. So case or does a lot of these things and I don't know if anybody's familiar with kube flow. But k serve used to be under the umbrella of kube flow as known as kf serving. It used to be called kf serving and basically a couple years ago or so it was pulled out of kube flow and made into its own independent project renamed to take a serve. So that's just I guess a bit of history about the project as well. So one thing that case serve does or I guess the way that it's built is on this paradigm known as the pod per model paradigm. And so if I want to deploy a model that means that case serve is deploying a pod and inside that pod is the runtime as well. Maybe there are other pods associated with this one model as well. And you can start to imagine that if I want to deploy multiple models. That means more resources that means more pods are being deployed and so on. And then there's this trend in the field to deploy larger numbers of models for ideas like better accuracy and even to protect privacy better accuracy because you know if you think of maybe like a news classification service or something. You would want to model maybe per news topic as opposed to in order to narrow the scope as opposed to to making one big model that kind of does it all. And protecting privacy has to do with separating data and so the ability to train models by isolating data say you're you know where I used to work at the grocery giant. They wanted models at the user level and so in order to not have any data leakage or in order to protect privacy between the user data as well you kind of isolate them and train models on just that particular type of data. But again on this pod per model paradigm you start to think about the limitations that are somewhat straightforward like compute resource limitations. The maximum pod limitation it goes against you know or I guess you rub up against Kubernetes best practices when you start to think about thousands of models and that means thousands of pods and the same thing with IP address limitations because each of those pods need to be assigned to them. And so the answer to these limitations that's where model mesh comes in. And model mesh is interesting because it was actually in production at IBM for maybe the last 8 or 9 years now and it continues to serve as the backbone for a lot of Watson products such as Watson assistant Watson speech to text and so on. And it was created because in the case like Watson assistant if anyone's familiar with it it's essentially you can kind of create maybe like a chat bot for your store or for whatever it is and it's kind of a speaking agent. So say I had a flower shop and I want to create this little chat bot to automate calls asking me what kind of flowers I sell then I can quickly go into Watson assistant and just create that sort of speech pattern and type in you know what kind of flowers that I sell and then Watson assistant will do the rest and And so and that's just one small model for my small flower shop but then if you imagine like thousands of people who want to make their own flower shop language models. Instead of making one very large language model you have just a few maybe kilobyte sized language models that that are just specific to the flower shop. So how would you actually orchestrate all of these models and so that's that's kind of the idea of model mesh and it's designed for these this specific use case like this high density or frequently changing model use case high density meaning thousands, tens of thousands hundreds of thousands of models and frequently changing meaning that maybe not all of these models are are popular used at this moment right there there might be certain times of the year where my flower shop language models are being requested upon often and other times not so much. And so model mesh intelligently loads and unloads the models to and from memory and that's that's kind of the main thing it does it can handle tens of thousands hundreds of thousands of models but not all of them are on memory. And basically model mesh is kind of the brain of the operation which is optimizing between responses that responsiveness to users and it's computational footprint and I'll go into a bit more detail about how it does that here. And so the main ideas of model mesh is this idea of cash management, essentially pods are managed as a distributed LRU cash, so you can think least recently used cash where models are loaded and unloaded based on two main variables the usage recency and the current request volumes. So you can think that a model which has not been used recently, in other words not inferenced upon recently, and has no request volume whatsoever would be simply taking up space and memory so we can unload that model, but then as maybe the volume rises or request volume rises for that model then we can load it back in and it can stay there. And then we get into intelligent placement and loading, which is talking about the idea of model mesh being able to scale out models across pods as well so if a model has a large request volume it can make several copies of the models to have that model to handle those requests. And then there's other things baked in like resiliency and simplicity, the idea that failed model loads are automatically retried in different pods and rolling updates right I kind of mentioned rollouts are handled automatically so there's this idea of an inference service in case of a model mesh which is essentially the mapping from a model to the pod and that inference service is mapped to the model or something called the V model and if you wanted to update that model or create a new version model mesh will make sure that that new version of the model is completely loaded before the inference service will actually direct a new traffic to that model. So this is an overview of model meshes architecture. There's a few things obviously to look at here. So for every model that I deploy or every. Yeah, so for every model that I deploy there would be a runtime deployment and that's what you can see in this light gray. Imagine that this runtime X deployment is like a torch serve deployment and runtime Y might be Triton server and each of those deployments have a couple of pods and that's that's just the default number that would come with it and the letters inside of the the container the model server container are our models and you can see that there's copies across pods maybe B and C are more heavily used and then AD and E on the left there. And then there's a couple of other sidecar containers for example the polar which is specific to the type of runtime or the type of model that you're deploying and that's the logic of you know pulling the model from storage and into the pod. And then we have the mesh sidecar which is the kind of brains of the operation doing the routing the caching and the placement of models. You can also see on the right model mesh brings in its own at CDN since which contains metadata about the models and that's where it would be getting information like like like model sizes or model IDs and so on. So that's a bit about case serve and model mesh and now I'll show a bit of an example I guess of monitoring a model mesh instance and what that might look like. And here in general is just some information about why we want to monitor the state of resources and usage especially when it comes to a high density model serving environment. You know and this isn't necessarily monitoring the model itself or the accuracy of the model or you know it's drift or anything like that but more on the upside so things like cluster utilization and model sizes and model loading times. And of course by monitoring this metrics one one can react to and make changes to the resources if necessary. Especially when you're talking about hundreds of thousands of models usually it's not just one model that's failing if if you see models that are failing it's probably an entire type of model and it's multiple and so that would need debugging and so on. And so model mesh comes with metrics that are exposed via an HDTV endpoint and here are just some high level instructions on how how one would would set up maybe their Grafana dashboard and take advantage of these metrics that model mesh is and that would be using Prometheus the open source monitoring framework to monitor the the namespace in which model mesh is deployed. You would create a service monitor which is a resource that Prometheus provides which is simply scraping the metrics at the endpoint the model mesh is exposing and then you would access Grafana which is usually used hand with Prometheus but essentially a visualization tool and model meshes repository actually comes with this this out of the box chase on template for for a dashboard. And here I have an example of what that might look like. I think that's probably visible and yeah so so this is just the first half and there's actually more updates coming on this dashboard. I think within the next few weeks it would be released but I'll talk a bit about that but at a glance you can see things like like cluster capacity and utilization at the top. The number of pods and and on the left side. Second from the top you see model counts and I guess this is interesting because we can see in green that 60,000 models are actually managed which just means that there's 60,000 models that model mesh is aware of and they're registered and but the number of models that are actually loaded into memory is maybe closer to to 5000 or so. And that's the idea right that there are 60,000 models that are ready to be inferenced upon and as soon as I want to model mesh will load in a model if it hasn't already been loaded into memory. And on the right that messy graph is just the sort of dynamic nature of the of models moving or being loaded in and out of pods over a few days. And and then the bottom half of this this dashboard is essentially inference inference request rates and response sizes and so on. And the second half of this dashboard we see some more inference inference API requests or response information but in the middle we see the cashmiss rate and this one's interesting because this is capturing maybe the delay between a model that isn't loaded into memory so one of those managed models and maybe if I make an inference on one of those that would be considered a cashmiss and we can see that maybe the slowest one took about close to 1.5 seconds to actually be loaded into memory and then and then inferenced or or response sent back on the bottom left of course we see the model loads and unloads over time we see sizes and loading times and yeah in the next few weeks there would be in addition to this to this dashboard at least the out of the box dashboard because the graph on is completely customizable so you'd be able to make it the way you want it to be but we we added in the partitioning I guess by serving runtime so if you wanted to just focus on looking at only say your torch serve resources so however much resource you've dedicated to torch serve or ML server or Triton server you would be able to see all of these partitioned by by those serving runtimes. Which pulls me into serving runtimes in general which is essentially the template for pods that serves one or more model formats and right now model mesh supports out of the box. These following four we have Triton ML server over a mass and torch serve but there always exists these cases where the out of the box integrations aren't enough. So for example the framework might not yet be supported maybe you have a custom model that isn't you know an easy like S K learn model or actually boost or something like that that you can easily deploy on on one of those those integrations. Maybe you have some custom logic for inferencing and so that's kind of the nature of K server model mesh is this idea that CRD is allowed for extensibility and you don't ever have to touch the model mesh control or code or anything like that. And it's especially easy of the desired custom runtime uses or is written in Python because ML server can be used to kind of provide the serving interface so you can take your custom model and and kind of use ML servers interface and that'll be the glue between your custom model and model mesh. Which is actually the example that I go into a bit more here. So again some high level instructions you would obviously prepare your your model and interface it with with ML server which I'll show you in a second. You would containerize that that into your own custom image, which would include the model and all of its dependencies to run. And then you create that new custom sorry that new serving runtime resource and and serving runtimes by the way in at least the latest release of model mesh can be either cluster scoped or namespace scoped. Meaning that the cluster scope serving runtime of course is is available to the entire cluster so if you have multiple model mesh deployments they would all have access to that serving runtime and the namespace scoped would be scoped only to to a specific namespace. But here's an example of what I consider I guess a an implementation of a custom Python model. I have a model called the tourism activities model. And here I'm implementing ML servers ML model class and that's just the fact that this is a Python based model it's the easiest way to to get this custom model into model mesh. And all you have to do is work on the or sorry implement the load and predict functions the load function obviously loading my model from wherever it's saved and then the predict function doing the inferencing and sending out the response however it needs to. I will say that my model is really no model at all it's simply a JSON parser. So if I send it any JSON it'll spit out what it's parsing but it'll also throw in some some maybe hard coded information out as well. So here we have our custom runtime definition and. And we can see that the the custom runtime name is the tourism runtime and I call it this simply because this custom runtime of mine I only wanted to run tourism models. This could be you know an ML server runtime a torch serve runtime but my my tourism runtime is is is the way I'm naming it and then we have the supported model formats here. This is important because this is how model mesh will know which run times to use when deploying a model so here my tourism runtime is supporting only the model formats tourism models so it only takes tourism models. This could be also you know SK learn or or or something like that. And then of course we're we're we're pointing it to the custom container image that I created prior which holds my model and all of its dependencies and then there's there's room for environment variables depending on the runtime if they require it. And then when it comes to deploying the model the model of course has to be on some shared storage and we would create something called an inference service to serve the model. I think I mentioned a bit about the inference service before but the inference service is basically the mapping from the where the model actually resides into the pod itself. And it represents as it says here the model's logical endpoint for serving inferences and then we would of course check this the status and I'll show a bit of a demo I guess of what that might look like. And and here's our our inference service so we have the model name I named it the tourism activities recommender and we have the model format the tourism model and again this is important because this is the way model mesh would know. Oh this this format is a tourism model so I have to go look at the available run times and see which ones support that. In this case though we have this option of spec which is me explicitly telling model mesh that I only want this this inference service deployed on the tourism run times. And this is optional because model mesh is smart enough to look for which which which run times support this model format and of course we're pointing to where the the model is actually located. So with that I can pull up a terminal and I can kind of do. A lot of this live just to see what that looks like. Yes. Yeah. Okay so here I have a model mesh serving instance already deployed and we can see our three main pods we have the controller local Minio which is where my models are sitting in the at CD that I mentioned before holding metadata about any of the models that are being managed. If I look at my serving run times. We can see I actually have four available to me. We have this OVMS one time we have the model type. This is pointing to like or explicitly telling you which model formats it they support so we can see like our Triton runtime supports these Keras model types and then which containers they're built upon so we can see that my tourism run time as expected is built on this ML server container. And if we take our our inference service now and I'll just quickly show you that this inference service is exactly I guess actually as as I showed in the in the slides. But if we deploy it we will see that two pods of spun up and we have this tourism run time pod and the spun up and they're running pretty quickly but sometimes of course it takes some time to create it. And what we would see when we look at our inference service in this case we can see that it's actually ready and it has its endpoint. But in another case where maybe the pods take more time to spin up we could have seen that of course the ready would be set to false and if you were to describe this inference service you it would tell you the reason for example that waiting for Todd pods to spin up I might say failed to load the model maybe you couldn't find it and so on. But in this case our pod spun up very quickly again to is the default and we have this inference service pointing to my model and storage ready to be inferenced upon so I will just put forward it and then here. I have a script which is simply making a gRPC request sending some payload some JSON to the model and and getting its response here printing it out so here's just some data metadata about the the request itself and and this is kind of what this again this trivial model does but basically it spits out the payload that it received which is here this this name Vancouver activities the message where should I visit in where should I visit in Vancouver maybe that could have been grammatically a bit better but. Before it said what should I visit in Vancouver so I really wasn't thinking and then the server response which is just telling me some few places that I can visit in Vancouver like Stanley Park in Vancouver aquarium so again this model of mine is simply a JSON parser and then outputting this this this hard coded response but I just wanted to show how to take a Python based model a custom one that isn't trained you know using some library. That's already supported by one of my run times how I merged it into ML server and created my own container for it and then ultimately. Deployed it on model mesh and just like that it's it's ready to serve. So. Let's see. Okay. So yeah that's just what the script looks like and then here yeah just some links and contact information so model mesh is always looking for contributors obviously it's a growing project case serve as well there's there's lots to do on both sides and we have a biweekly call. Which you can find more information on in the case serve channel and it's still under the cube flow I guess slack workspace but in that case serve channel is where you can talk about case serve and model mesh. And we have this this monitoring and performance testing repository which is interesting if you're more interested in some scripts that will actually like stress test model mesh and its capabilities and easily deploy. You know tens of thousands of of small models that would be the place to check out and of course we have links to more about the customer on times in the Grafana dashboard and my contact information I threw up my colleague Christian who who also works on model mesh with me at IBM. In case you don't like me and you'd rather reach out to him it doesn't matter. And and yeah that kind of concludes everything I wanted to talk about so thanks for listening. Okay. If you have a question I got to make you just be clearly for the people battle. Thanks for the talk. I had a question about the sidecars that I was wondering why the polar was separate from the mesh sidecars that make the polar a pluggable or is there some other architectural reason. Yeah I mean that's that's definitely I guess an architectural design question that's that predates me but I know that the polar has. It's specific to the type of runtime and I guess your question is why why are there maybe multiple of them in the pods or why why it's not like just embedded into the model mesh controller itself. That's probably the idea of keeping them separate I guess the polar has code that is specific to the models server itself so like torch serve is expecting a specific way to get a model and to make that model available and so on. While the mesh is kind of I guess a separate entity and trying to be not I guess merged with with that type of code. Yeah I mean I can connect you to the architect himself or you was it red hat or no. Yes yes okay well I know we probably have each other on Slack so if you're you're more interested I can connect you with the person who designed this at IBM. But but yeah that's all I got. No thank you. Hi thanks for the talk so I'm wondering like what's the biggest use case that you use model mesh for like how many concurrency for the wallows how many models and the total like data model size and also how many nodes involved. I mean I don't I definitely don't have all those specific numbers but this is this this temp this like is just a sub component of some. Watson service in production and just here they have 60,000 models. It was built for hundreds of thousands of models I know on this if you're really interested in specific numbers like even in this model mesh performance repository there's some statistics of just how much it can scale up. If you want me even after we can open it up together and I can point you to where that information is but but yeah like to give you an idea this is a very small component of Watson and it's already managing 60,000 models so you can imagine that there's there's there's a lot more. But yeah. Okay thanks and there's another question for the model versioning so basically like if you frequently update the model how can the model mesh measure the application they load the same version of the models. Sorry what was the last part how they can how can they like make sure that all the applications they load the same version of the model. So like I did the model today and then how can I make sure other applications they use the make the version A instead of version B. Well I guess that's this idea of the V model which is again like the inference services mapped to the version of the model and the inference service won't actually point to the new model until it's completely loaded and ready to go. So I don't know exactly. So just like model serving cluster they have maybe dozens of nodes and they maybe are already loading the model so you mean that maybe from server side they already know that they need to update the model at a time. Instead of when they are updating the model they could update the old model at the same time when you are loading them updating the new model. So I'm wondering would there be a model versioning thing. I'm not sure if I understand completely but basically like at least from what I understand the idea is that the the inference service won't be changing any models until that new model is actually loaded. And so the idea there is that that traffic is only pointed to that to that that already deployed and current model before before moving to that that new model. So basically decide on the business side. Yeah I guess yeah I guess and I don't know if it's something maybe that model mesh doesn't have and K serve might have that feature because I know maybe you're touching on this idea of like even like the multi arm banded approach or or canary rollouts and things like that like those more advanced type of features. I know K serve has and we're working to bring into model mesh. But yeah. Thank you.