 Hi everyone. Thanks for joining us today. My name is Raphael and this is Christian. We're both part of the OpenTech team at IBM, where we primarily work on as open-source software developers on ModelMesh and more recently KKit and adjacent projects there. So today we're going to talk a bit about a few things. I guess this is an outline here where we're going to talk about model serving. I'm going to talk about what it is and why we need it, some of the things that we do to actually get models into production. Then I'll introduce ModelMesh and KServe, which are two open-source projects and I'll talk a bit about that. Then we'll move to the model inferencing side. Christian will introduce KKit and run through a bit of a demo as well. So as we probably all know by now, model serving or I guess MLOps or machine learning lifecycle, it involves way more than just training a model and putting it out there. All of the models that we interact with, online, LLMs, anything, they're all deployed somewhere. That concept is model serving. Model serving is basically deploying trained models in production, making them available to consumers to be able to send them inference requests and actually receive its output. Ultimately, this is to actually incorporate the AI that we've created, right, these models that we've built and incorporate that value into an application or end user. So one of the ways that we do this is to actually treat a model as a microservice and expose it to clients as an API endpoint, whether that's REST or GRPC. And this mirrors the strategies used for existing software stacks today, which may be why it's popular because developers already have a certain familiarity with this process. But when it comes to deploying models like this, there are a bunch of considerations and a lot of obstacles along the way. For example, how do we actually containerize a model? Are inference response times acceptable? And then how do we handle different frameworks and different model formats? There's so many out there, TensorFlow, PyTorch, ML server and so on. There's also other questions like is a model underscaled when we're deploying many models, maybe some are used more heavily than others? And are we actually using those resources efficiently and giving the models those resources that they actually need? There might even be questions of like how to handle rollouts, new versions of models. Are we just updating models or are we deploying new models, which are new versions and holding onto these old models that are no longer going to be used? So there's a lot of questions. And that's where an open source project like K-Serve comes in. So K-Serve is a standard cloud agnostic model inference platform based entirely on Kubernetes and it's built for these kind of highly scalable use cases. And it has a few high level features. For example, it provides a standardized inference protocol, which is important because many of these frameworks, such as PyTorch or TensorFlow have accepted and even participate in implementing this protocol. It also supports modern serverless inference workload with request based auto-scaling, including scale to zero on both CPU and GPU. And then it has many other features like pluggable features and support for things like pre and post processing. So manipulating the data before it actually gets to the model. Monitoring, which is not just on the off side, but also monitoring things like how the model is performing like bias or drift and explainability as well. So with machine learning becoming more widely adopted in organizations and especially in this era of large language models, it can be advantageous to deploy a large number of models. For example, maybe a news classification service may train custom models for each news category. And there's a lot of kind of positives from that. And another important reason for why organizations might train a lot of these models is to protect data privacy because it can be safer to isolate users' data and train models separately. But while you get the benefit of better inference accuracy and data privacy by building models for each use case like this, it's more challenging, especially once you get into the hundreds, the thousands, the tens of thousands, the hundreds of thousands of models and deploying all of those on a Kubernetes cluster. And so this is where you get into these hard limitations, like obviously resource limitations and even stronger limitations like the maximum pod limitation in Kubernetes and additionally the maximum IP address limitation. So that's where you get into another project which is connected to K-Serve and it's basically the multi-model back-end option in K-Serve which is Model Mesh. And Model Mesh is a multi-model serving framework and it was open sourced by IBM a couple of years ago. It was used in production for about maybe eight or nine years and it continues to be used today underpinning a bunch of cloud or Watson products such as Watson-X assistant, Watson natural language understanding and so on. And these are kind of products where at least I think that Model Mesh came to fruition because of these cases where IBM was allowing users to come and build their own models and they might have been small little language models at the time but a lot of them were maybe left over and not used as frequently. You also have these situations where maybe certain times of the year models are used while other times they're not. And so you get into this situation where you need to under scale or over scale models and Model Mesh kind of is designed for that use case like these high scale, high density, frequently changing model use cases and the main thing that it does is it actually loads and unloads models to and from memory to strike a balance between responsiveness to users and of course their computational footprint or their resource efficiency. And so therefore the number of models which can be deployed in a cluster will no longer be limited by the maximum pod limitation because another thing I guess I failed to mention is that case serve is actually primarily working in the pod per model paradigm which means that every model that you deploy, one pod spins up and then maybe there's extra pods like a transformer or something adjacent to it. So if you have maybe a handful of pods for every model, scaling that up is pretty difficult in the use case where you're going into like thousands of models. Model Mesh though is multiple models kind of packed into a pod and so that's why these like maximum pod limitations or IP address limitations are no longer relevant. And so again at a higher level Model Mesh features at a glance are here. We have cache management where pods are managed as a distributed least recently used cache which means that models are loaded and unloaded based on usage recency and current request volume. So this means that if a model has not been touched in a long time then it probably will be unloaded by this logic. But if a model is heavily used then it'll continue to be loaded and stay loaded. Then there's intelligent placement and loading. Model placement is balanced both by the cache age, how long it's been in the cache across pods as well as the request loads. For example if a particular model is under heavy load it'll be scaled across more pods. And then we have resiliency, failed model loads are automatically retried in different pods and rolling updates are also handled automatically. So that's an overview of Model Mesh and HayServe and I guess the the model serving side right deploying models into production. But from there we need to even when a model is deployed and ready to be consumed one actually needs to know how to inference it. And developers who write applications that consume AI models are not necessarily AI experts who understand the intricate models or the intricate details of the models that they use. For example some considerations here are that not all runtime support all models. Most of them if not all of them are proprietary and each look and feel a bit different there's no real standardization between them. Dealing with different data structures right what kind of data do I actually have to pass to this model and what kind of data structure do I expect to receive from it. Interfaces usually depend on the runtime and the model so this is something you have to consider when you're thinking about which model inference API to use. And and then there's an overhead that that is they're serving models from other runtimes incur overhead when they're trained in one runtime and you try to port it over to another for deployment. So ideally and I guess I'll kind of explain maybe the ideal scenario for a developer is that we can deploy or we can treat models as a black box function for their users and it's similar to cloud computing where you might deploy an application to the cloud but you don't necessarily have the detailed knowledge of the cloud infrastructure surrounding it. And this comes with some some benefits for example you might consider something like configurable back ends which allow models to be created in different runtimes and be served in a runtime agnostic way that's ideal. There's also a trend to specify APIs around a problem space as opposed to a specific model. For example you would write an interface for text summarization and maybe a different interface for code generation. Ultimately not all users are interested in the inner workings of a model or how it's deployed and this abstraction between the author and the user would allow us to focus on more things like load balancing to different model instances and to different versions but without having to change any of the models code. And so to that end I I'll introduce Christian here my colleague who's going to introduce the project Kkit which does a lot of those things. Thanks Rafa. Yeah so Kkit is really an AI toolkit and the idea for Kkit is is that when working with AI models we have different roles involved in the process. There are MLops engineers there are application developers there of course the data science scientists who work on those models and they usually all work in their own realms and one of the aims of Kkit is to bring all of those user roles together so have the the model offers and the model users using the same framework. So with Kkit ideally you should be able to train your your models and use them and have a bunch of simplified APIs and very clearly defined data structures that help with that process of creating AI models and using them and then integrating them into your applications and that hasn't been easy so far and most application developers including myself are not AI expert or data scientists so having a toolkit that makes it easy to integrate AI models to solve a task to solve a problem is really useful and so you can see in this little architecture chart here we have those two roles of model authors and model users or operators and they would typically want to interact with simplified APIs and if they work with Kkit they can request a model to do some inferencing or submit a training job and this is all done via REST APIs that I will show in a bit but Kkit itself provides a GRPC server and that then that process can run locally on your laptop while you're doing your development work and then in production typically you would deploy it somewhere on the cloud infrastructure based on Kubernetes or Red Hat OpenShift so I'm going to explain in the next couple of slides how Kkit actually does that so the conceptual view for Kkit is that it is an abstraction layer that allows application developers and model creators to consume AI and work with AI models seamlessly so here are the two user profiles that we're touching on the AI model author of course is the you know typically data scientists who creates the models you know who takes care of getting good data you know how machine learning models work different algorithms work and produces the model the model operator is typically more like like an application developer quite somebody who wants to use the model doesn't necessarily have to understand the internal workings and intricate details but as application developers we're used to using toolkits right if you work with Python or go there are libraries to solve almost any kind of challenge and even though we like to program ourselves we are very hesitant to reinvent the wheel right and if there's a library out that it does something better than we could we like to use that and we got used to that over the past decade or so AI doesn't work like that as of now with the emergence of hugging face though AI models become widely available to application developers but it's still a challenge for somebody who doesn't have the background all right so cake it how does it actually do that thing now cake it has two main concepts the concept of modules and run times so as you can imagine run times is the part of the of the program that actually runs your model right you need to deploy your model somewhere and there are different run times for training different run times for inference right we use ray for training ray for serving ray serve there's the text generation inference server or try to serve models for inferencing and then the on the usage side we like to think of AI models like like a task that solves right there's text summarization models where you want to give the application a piece of text and you want to get the summary back or you want to analyze sentiments in chats right you don't want to you just want to you know give as an input the actual chat text and get the sentiment back and not tensors and so cake it designs the user interfaces around a task to be solved and of course there's you know natural language processing vision tasks and I'll go into more details in a bit here is a bit of a an overview chart of how the application you know works the top blue corner that's your application and then via cake it you talk to different run times so in this picture it's a distributed example you have several cake it run times so probably running somewhere on cloud infrastructure and you can see that when an application developer requests a model and I think in this diagram you see model X for task a and model AB task B right the task is what you what you want to do text summarization you know natural language understanding and then the bigger orange box is a cake at runtime and there are actually multiple instances here for load balancing and in that cake at runtime all of the models that that runtime mix available they are being served and you can see there is a backend adapter so cake it itself is not the runtime or you know doesn't have to be the runtime itself but it uses inference servers like Triton or TGS here to actually do the heavy lifting and then of course the pink box the router depending on what application or what model or what task you want to you know you're working with then the router decides where to serve that request wherever the model is being deployed and that leads me to my demo for that I have to ask for a moment of patience because I have to swap screens all right I'm gonna start with our github repository or the github organization really for cake it so if you go to github and find cake it you'll see a few repositories and then there's the cake it main repository it has the actual the actual code for the project and then there are some projects that will help you make use of it and so there is the the cake it template that will help you I'll show that later will help you get started where you can create your own github repository with all of the cake it infrastructure so to speak laid out for you where you can fill in your individual tasks fairly easily and for developers who you know just want to get started quickly and most likely are using hugging face models mark who's in the room here too he has created this awesome hugging face demo which I'm gonna showcase so I'm gonna click on this real quickly when you open this repository up it will tell you what it is right it's you know they're simple configurations for various hugging face models there's some setup steps that I will that I already gone through and then once you have it locally you have a choice of running a few example tasks there's a sentiment analysis there's text summarization even a text generation task and I'm gonna show the object detection all right so when you clone this right you've all done this before you copy the URL and then you would probably open it up in one of your favorite ID ease or whatever works for you I'm using Python here and you can see I clone this repository earlier and you see the project structure here close this like a preparation and then typically you land with the read me and the read me nicely tells you all the steps you have to go through in order to get this up and running it shows you the prerequisites right you need to have Python obviously you want to work with virtual environments which I've already set up and then you install all of your requirements and that's all you need to do and then in this repository you find in the kikid hugging face demo folder you see examples as I mentioned a minute ago and there are examples for image classification object detection sentiment analysis text summarization and then you can choose which of those you want to try at first and then you copy that into a models directory which by default is where kikid expects your models of course they can be configured I did that just before the demo and I'm gonna start by running it so running it is fairly straightforward you just start the app or if you have a really cool UI IDE you can just click on the little run button here I'm gonna go the old-fashioned way on my terminal and then so when you start the app it will bring up the packet process spin up the gRPC server and it will tell you which models were loaded and then you have this URL that you can click on and you're presented with this really cool radio UI in your web browser and here you can see there are three tabs those are the three models I selected there's a sentiment analysis object detection image segmentation I'm gonna try this out real quick okay here I'm entering my text see it's already okay it's already giving me a sentiment this is good positive this demo this demo is negative is going also negative well that's positive all right I think that's positive and then let's try object detection this wants me to upload an image I did find a couple of cat cat images on internet just to showcase this I'm gonna upload this here and then very quickly this hugging face model that we're wrapping here found that there's a couch with a cat and it even found a pot of plant right there in the back so that works now in the repository that I showed you earlier none of these things are magic right you could can easily see that all of these models they have a configuration file and the configuration file is fairly slim in this case it's just an ID and then we're mapping those IDs two actual modules right and if I look at the sentiment analysis here I'll find the sentiment analysis task right and there's the actual code so this is a hugging face module right we're extending from module base and then every module typically has a load method that tells cake it how to instantiate that module here you would find this is a hugging face transform a pipeline you've probably done this before and it's just a simple wrapper around it right it loads the config file and then loads that pipeline and then of course you have a run method that's the actual interface so to speak the API that a user or developer would want to interact with and this is fairly straightforward right each each of our inferences where it would give us a class info right that's our sentiment the label positive or negative and the score with the confidence and that class info object is part of our data model which you find under all the data models and there it's fairly straightforward right you have the prediction with its classes list of classes to in our case and you know there's a confidence and the actual class name so that's fairly straightforward now that is fun fun to play around with but then how would you go about doing it yourself and for this we would want to go back to the cake it get up repository and here you should find the cake it template and the cake it template is just a get up template which you might have done before which lays out the typical project structure for a cake it project like you can see it has your your server code your client code and then of course the actual cake it modules and the data model all laid out for you and this you can typically use by clicking on the use this template button and then you would create your own repository cake demo and this already did so it's not available anymore so I did this here just before the demo and cloned it and I brought it up in my pie charm ID and just as you saw the structure on the GitHub repo for the template right you have your your template code you would probably refactor remain that to whatever you're trying to achieve and then in that folder you have your configuration configuration files that tells you know the cake at runtime which models you have what the interfaces are what the data model is you would have your actual module code and the data model describes the inputs and the outputs of your model now in the template itself you have these hello world examples and they're really straightforward it's just you know text in text out and then I added a text sentiment task and you know and I used the code that you can actually also find on the cake it website I'm gonna jump back forgive me for a second so when you go to cake it the cake it repository here's a link to the cake it website I'm gonna go here real quick or there it was to which has all the information you would need that website describes you know what cake it is how it works and it has this try out tap that mirrors the same project structure I just showed in the template gives you a few setup steps and even the code you can use to wrap up your hugging to wrap your hugging face module so this is basically what I did I followed these steps and I filled this in into my cake it template repository this is the code you see here in the text sentiment and the classification excuse me and then if you follow the read me that isn't that repository too you can easily find out how to actually run it right so you have your your demo has a server where you start your runtime gonna do this right here and then here you would probably see a few deprecation warnings because I'm using an older Python version and then it shows you the URL that you can open up in your browser to see the fast API do this docs right so fast API used to be swagger or open API and it's a very easy way to document your APIs and to actually try them out so here you see we have this hello world task that comes with the template repository and I added this sentiment analysis task and here we can click try it out and then I think the model ID is text sentiment oops and then let's see if that works I got the ID right it does something yeah it has a positive sentiment for what okay my input here there it is text let's say today is a great day for demo we run this and indeed it works positive positive sentiment very confidently and let's try the opposite today is a bad day that should be negative obviously and indeed that's negative so that's really all I wanted to show how easy it is to use cake it where to find it and how to get started using it yourself and looking at you know all the examples and then that should help you to you know wrap all kinds of hugging face or even just use the module wrappers we have and swap out your hugging face model for whatever model you find do we have time for questions we might have three minutes for questions hold on hold on yes you're right if I can find it where's the present of you let's let you know there you go so yeah so these are just some links and some contact information but links to our GitHub repositories of the projects we we talked about and then specifically when it comes to model mission case there's also a slack channel under the Kubeflow organization that you can join if you want to get involved or ask questions as well thanks I was wondering what's next for cake it and or for model mesh you want to go first you go first okay so for model mesh maybe so model mesh really aims to to help with this high-scalability use case where there's a lot of small models that get you know frequently shoveled in and out of memory or wherever you deployed it but more recently as you know with chat GBT and large language models that paradigm is actually not really helping and so for large language models the focus goes back to case here right and then you know trying to integrate run times that are well suited for large language models but then along with large language models there's also fine tuning right or prompt tuning and so there is still a case for having a bunch of small models around one or two big models right so this is like really an area where was like we're just exploring ourselves and finding it cake it however plays a role there too for text generation inference we're actually using cake it in a stack you know long on writer data science for example cake it is used with model mesh and case or and TGIS to really you know serve large language models and use cases around it so it's like all one big bucket and it but it's in development so I'm sure that within the next couple months read it specifically read it open data science you know you will actually be go out there be able to go out there and explore and maybe you take it with a large language model sure on on red infrastructure I think that that's it I think on the model mesh side it's also the other it's exploring more use cases with the LLMs maybe there's that use case of like I mentioned experimentation like it's not just okay I have a large language model it's let's deploy that but I might have multiple multiple versions of it and then what do you do with those those ones that kind of the older versions that aren't used as often you know this idea of loading in and out of memory and and kind of trying to be efficient about resources with those types of models I think there's there's some use cases there that probably will be explored too definitely thanks for the question we've wrapped it on time you did okay thanks everyone