 Hi, everyone. How are you? My name is Alexa Griffith, and I, like you said, am on the data science platform team at Bloomberg. I've been there now for a few months. I started in January. And what the data science platform team does is we support multiple machine learning features from training to serving models. I work specifically on the infrastructure team that helps with infant services, and I work with Dan Sun. I'm lucky enough to work with him, who is a co-founder of K-Serve. I've been a software engineer now for a little over two and a half years, almost three years now. So a lot of things are new to me, which is great. And I find that I enjoy learning new things by, you know, writing or discussing those topics in, like, blogs or podcasts, and sometimes being a beginner can hopefully bring some new insight or new perspective into a topic. And I like to have fun with it, so they're going to be some silly, sick-figured drawings in the mix. But yeah, so that's pretty much what I'm doing. When I joined Bloomberg, I was very unfamiliar with K-Serve. So I find that sometimes abstractions on top of Kubernetes can be a little confusing or a little interesting to talk about. So in this talk, I'll to present both the basic and advanced features of K-Serve from that perspective. So as you can tell, this is my first PR that was just merged for K-Serve. I'm a really big contributor, so I'm obviously very qualified to give this talk. But no, this is a joke, but this is my first of, hopefully, many contributions. Yeah, so let's explore K-Serve. So does anyone know who lives on Planet K-Serve? Who? Oh, you don't know? Okay. Well, it's the K-natives. Apparently, there's also a best joke, a competition today that I did not know about, but I'm not saying it could win, but it might, if you vote for it. Anyways, okay, so given the environment that this is KubeCon, I'm not really going to go over exactly what a pod or what a container is, but I do want to just focus on the basic Kubernetes controller pattern, because I think that's relevant to my talk today. So Kubernetes implements a controller pattern, and the control plane is on the dotted line. Each one has some components, and each component is in charge of handling different parts of the system. So the API server is like the front-end handling some requests. The scheduler manages the scheduling and eviction of the nodes, et cetera. So similarly, there are components on the nodes themselves that make sure everything's running smoothly. So like the network proxy that allows communication between nodes for traffic, et cetera, and there are other add-ons that you can plug in, like DNS. So Kubernetes allows you to create these things called custom resource definitions. I'm sure most of you are familiar with this, and custom controllers. So when you combine these two, you get a declarative API or desired state system. So this is what K-Serve does. It extends the Kubernetes API. So I'm not a machine learning engineer, but I work pretty closely with them. I know they're like a few parts of the machine learning lifecycle or workflow. Of course, you have like the pipeline development, the data pipeline. You can develop it. And there's also a training component where you need a certain data set to train the model. And then once you train the model, there's also an inference component. So use the model in some live machine learning algorithm to give some output. So today, I'm going to focus on the inference component. That's what K-Serve focuses on. To be even more specific, K-Serve focuses on what I'm going to call core inference, and it focuses on the deploying and monitoring of the core inference part of an inference service. So the goal of K-Serve is to make both the monitoring and deploying for inference services super easy. So let's discuss the components that make up part of the machine learning workflow and some commonalities between them. So just so we know what we're all talking about before we get into it, into what K-Serve really does, a lot of inference workflows have a predictor in common. So what exactly is a predictor? A predictor consists of a model. So maybe it's something like a TensorFlow, runs some regression, does some calculations. And it runs on a serving runtime called like TorchServe, for example. So all a model server is, it wraps the model instead of REST API so they can talk to the model. But this is a very simple use case. Usually the inference services also have some more components like, let's say, what we're going to call a transformer. So a transformer is optional but very commonly used, and it calls the predictor. It handles pre and post-processing of the data, like you have raw data and you want to turn it into vectors before you hit your model. So you can also plug in, and this is a very simple view of a workflow as well. You can have a lot of other components that you plug in, like something we call an explainer or detector to detect things how the model is doing. And we're going to talk about those a little later, but basically this is a transformer. And you can't just make this and go to production, right? There are a few things you still need to do to have a production ready inference service. For example, you might not be able to use any of the out of the box model servers, and then you might need to create your own. So that takes some work, some code. You're going to need an ingress probably to send requests from an external service. You need metrics, you need to set that up yourself and monitoring. You need load balancing, distribute tracing, scaling, both GPUs and or CPUs. And you might need a team that has extensive knowledge about the tools that allow you to do this, like tools like Knative and Istio that you might want to use. So the question is, how can this be made easier? Well, in comes KSERV. So KSERV was formerly known as KF serving. It was part of the Kubeflow project and it's taken out now into its own thing. And it's called KSERV. So this is the first sentence that KSERV has on its GitHub. And I'll just read it real quick. KSERV provides the Kubernetes custom resource definition for serving machine learning models on arbitrary frameworks. It aims to solve production model serving use cases by providing performant high obsession interfaces for common ML frameworks like TensorFlow, XTBoost, scikit-learn, PyTorx and Onyx. So what does that really mean? So basically KSERV is a custom resource definition. It's just built on top of and abstracts. It's underlying layers like Knative and Istio, for example. So it makes it really easy to use. You have a very simple YAML and everything is created underneath for you. So KSERV supports many common machine learning frameworks that data scientists use. It also allows you to build your own custom one if you need. So KSERV is made for machine learning workflows and allows for data scientists to easily plug in their model and deploy their workflow without needing to manually deploy instead of that infrastructure that makes that workflow work efficiently. So let's look at the main components that KSERV is built upon. So of course on the lower layer you have your compute cluster like your GPU CPU. And then on top of it we have Kubernetes as we're running on. And then this optional layer of Knative and Istio. Knative is an open source community project. It adds components for deploying, running, and managing services, serverless applications in Kubernetes if you're not familiar with it. So the reason I'm going to talk about it a lot today, even though it's optional, is because I feel like it actually makes the use case for KSERV a lot more interesting and things you can do. Certain features it allows for is like revision control, traffic management, and rate limiting. So if you see in the future if you're going to see a little alien pop up, just know that that is a Knative component that I'm introducing. So it is optional but nice to have. Also Istio is an open source project that focuses on security, connection, and monitoring of microservices. And at the top we have KSERV. So KSERV, like I said, is integrated with a lot of common machine learning frameworks and we'll talk more about those later. Okay, so here are some of the main features of KSERV. It allows you to scale up and down from zero to and from zero. You can have request based auto scaling based on your CPU or GPU. It also offers request batching if that's what you want and then you can do request and response logging as well. It offers security with authentication and authorization, traffic management, distributed tracing, and out of the box metrics. So here's a few of them and it also offers like rollout strategies like Canary, which I'll do a little demo on later. But here are just a few of the main features of KSERV and next let's talk about how we're actually able to achieve that. Okay, so let's talk about the predictor. We talked about a predictor before like in general terms, but let's walk through how KSERV enhances the single model use case of a predictor. So first we're going to have something called a storage initializer, which is in a net container. So it comes up, it reaches out to your model storage, wherever your model is stored, there's multiple options in KSERV for that. Download your model and then go away. So the next is the other two things that are going to be up running are the Q proxy. This is a Knative component. So it is optional, but it routes the request to a KSERV container with max concurrent requests configured. So it gives you some rate limiting. And the main part of the inference service is the model and the model server like we talked about and allows you to easily plug in some things out of the box as well as create your own custom model server. So next let's talk about a transformer. So transformers are commonly used for pre and post processing as I discussed earlier. And another great thing about this is that you can easily plug in certain things for feature augmentation like Feast feature store in KSERV, which is really nice. And you can also create your own custom transformer if you like. And now that we have a better idea of how KSERV enhances the core inference components, let's discuss the control plane. So if you can remember from the control plane slide, the control plane handles the state of the system. So the inference service defines a state and the KSERV controller is in charge of reconciling that. So we make sure we're getting to the desired state. It creates all the resources for the inference service needs below it. So next we have something like KNAV service, this man is revisions, network routing, and the KNAV revision that is like handling those revisions directly, and then the deployment. So if you don't have KNAV, for example, you wouldn't have the two things on the left, and you would just have directly deployment, but you wouldn't get a few of those features that are added. So moving on from the deployment, which watches the traffic flow and the scale helps go to the replicas. What we also have is you can plug in the KNAV protocol scalar or the horizontal auto scalar, which handles the replicas and scaling up and down, and then you'll have something like your transformer pods and your predictor pods. So this is just a higher level overview of what we just went through. Again, you'll see the model storage there, like I mentioned. There are multiple types of model storage that support it through KSERV, like S3, URI, Azure, and persistent volume claims, and more being added in this next version as well. So more options there. So this is like a wider view of what I just showed, and the reason I like this, because I use this command line tool called Coupetal Tree, I think it's really helpful for me to just see. I'm not going to go into each box, but the pink is Istio, the dark blue is the API's KSERV, and then the gray, the API is KNAV. I'm not going to go into each of these, but I think this is just helpful, especially as someone who's learning about this, to get a view of all the resources that are created, and then you can Coupetal describe or look into each of those resources and see what their settings are. So just for understanding how KSERV works, I thought this was super helpful. Okay, so next, now that we talked about the control plane, let's talk a little bit about the data plane. So let's follow our request path. So you have the AI application, it hits a transformer service, then it's going, sorry, then it's going to hit the transformer pod. Okay, there we go. It's going to hit the transformer, it's going to hit the Q-proxy and hit the transformer pod, and it's going to maybe do some pre-processing, and then it's going to reach out to the predictor service. And then it's going to hit the predictor pod, and then that Q-proxy, and then it's going to run the model, it's going to run your inference for you. Okay, so that was the data plane, and now let's talk about something that's interesting that's just added, the inference graph. So the inference services can be set up as a chain pipeline with KSERV, which is super nice and useful, and this new feature allows for chaining them in different ways, and the KSERV contributors have been working very hard on it. So I'm just saying some of the different modes you can use. So you can use sequence node, which allows you to connect two inference services in a sequence, which is like how I have it there. You can also use this mode called switch node, which allows users to select an inference service to handle a request based on a condition. So kind of like, which one do you want to go to? Slitter node allows users to split traffic between multiple inference services using a weighted distribution, and you can also use something called ensemble node, which allows users to route requests to multiple models in parallel. So here are some of the new features that we're adding. Yeah, I think that's exciting. So another important feature of KSERV is that it uses a standardize inference protocol across all of the serving runtimes so that people can easily switch from, say, TensorFlow to PyTorch without needing to change the way you talk to the inference service. So this also makes it really easy for testing and benchmarking for the user. So it's just something that's really nice for users to be able to use and be able to switch between models that they need to. So I have a little cute little animation here. He's like, hi, do you see KSERV's standard inference protocol? And she's like, yeah, I have standard endpoints for help, server and model metadata, and inference. And he's like, great, how can I talk to you? HTTP or DRPC, because that's what KSERV supports. Okay, so here are some of the data plane plugins. Like I said, we have this core inference part, which is the transformer and the predictor. But sometimes we need to add some more components and plugins into it as well. So we can do something called explain. So instead of hitting the predicted point, and this one you would hit explain, which would then hit predict. And there's multiple explainers you can use. So basically the way I understand it as an explainer says, why did my model make this prediction for a given instance? So if you have an image and it defines that image as a dog, then it will explain what part of the image is most relevant to making that result. So that's one thing you can have. A batcher for is a side car that's sensitive between the queue proxy and the KSERV container. You can set things like max latency and max batch size. And then also there's a monitor and logger plugin that you can add. So it has asynchronous payload logging, monitoring the distribution of incoming requests with detectors. And we'll talk about a few of those detectors in the next slide. You can also hook up alerting. And also I see people taking pictures, which is totally fine, but just to let you know, I do have these uploaded as well if you want to see them later, if you'd like to reference them. I have them on the site. So yeah, let's talk a little bit about how the monitor and logger works. So the broker routes events to the consumers. So you can have something called a trigger, which forwards events to a message dumper service. And you can set up asynchronous payload logging to dump metrics based on a trigger, like cloud events or Knative, anything like that. And you can monitor incoming requests with detectors in order to trust the models. So detectors like drift checks if the distribution of an incoming request diverges from a reference distribution like the training data, for example. The outlier detector flags single instances outside the distribution. And you can have something like a bias detector adversarial detector for adversarily modified inputs. And with all of this, you can also hook up alerting. So you can have good observability into your inference service. Yeah. Okay. So we discussed a little bit about some model serving runtimes, but there are a few of them. And K-Serve keeps adding support for more out of the box. For example, you can use the try an inference server, torts serve, sklearn, htboost, just some of the few ones you can use. And you can use with those tensor flows, torts script, pi torts, onyx, and htboost. So there are a lot of options and there are a lot more coming out of the box, which makes it really easy for data science to just plug in their serving runtime and get going with their inference service. So how does this actually look? You can define the serving runtime if you want in the, or you can define the model format in the YAML. So this is like a very basic inference service. This is what the YAML would look like. And you can define that you're using, say, an sklearn model. And from here, in the new format, in the V2 format, K-Serve will be able to just infer which model serving runtime you're running. You don't even have to define it, but if you want to, you can. So what does this do? It creates a custom resource called Serving Runtime. So on the right, you'll see the custom resource that it creates. And this way, you, as a user, don't have to set this up yourself. It's just generated for you. And you can configure it as you, you can configure it as you like. Okay, so let's talk about what we've, what we've discussed so far is a single model serving, is a single model serving model. So basically what that means is that for each pod, we have one model, right? So you probably at least have one replica of each pod running, just, just to be safe. What happens if you have, I don't know, let's say like five models, so at least have ten pods running. But you also probably have a transformer, and for each replica, you probably have another one. So you probably at least have, for each model, four pods running. So as you can tell, this is okay for maybe, you know, five models. What happens when you get to like a hundred or a thousand models? Obviously this isn't very scalable, and it's very costly. So Kubernetes also has this max pod limit about per cluster, so it's like 110, 100, it's best practice. And there's also an IP limit, IP address limitation. So there's like, you can have at most 4,096, so about 1,024 models are serving at a time with this use case. So there's one where we can do better, and that's called multi-model serving. So multi-minus serving is designed to address three types of limitations that case server run into that we just discussed. So the compute resource limitation, there's an overhead of running, you know, just one model per pod, the maximum pod limitation, and the maximum IP address limitation. But there's still a limitation to even this model, which allows you to run multiple models on one single serving runtime. For this one, each replicas were required to hold the same set of models. So you see always CBA and BCD, each pod can only have so many models, and those have to be defined. So this, in this case, you'll still run into some limitations if you start running at scale, and you have a lot of models. Don't worry though, because the case server contributors have been working very hard to find even a better solution for a high volume, high density use case, and we call this multi-model serving with model mesh. Okay, so imagine that you have a large number of models deployed, which isn't uncommon. So an example of this use case will be like a news classification service. You may have a model for every type of news category, and for privacy and security reasons, you might need to train those at the user level. So a lot of small running models. So using the many models in this use case will have the benefit of a better and more accurate results, but you're going to run into a lot of the issues we just discussed with inference services at this scale running on a Kubernetes cluster. So another use case, let's say like neural network based models are best suited for GPUs, but GPUs are usually expensive and running many of these is costly, so using something like what we're going to describe here in model mesh is usually a good option. So multi-model serving with model mesh is a solution to this problem, and it allows inference services to scale. So it decreases the average resource overhead per model, so it becomes more cost efficient, and the number of models that can be deployed in a cluster are no longer limited by the maximum pod limitation and IP address limitation. So what does this do? Basically you have the mesh which is routing the routing the request. You have the puller which is replaced by or replaces the storage initializer in a way. So in the storage initializer we had a knit container that came up, pulled the model, and then left. Now we have a puller which is a long running sidecar, which is constantly pulling models as needed. So that's the very simple use case. Of course you're probably going to have again like at least one replica, and this just shows that you can have multiple models on different pods running or different replicas of the same serving runtime. So model server x and the left pod can have model a, b, or c, and then the right pod it can have any other number of models that run on model server x. Okay, so now I have a small or a very short demo, let's see. This is just going to be setting up an inference service. This is like this is one of our, okay, this is one of our main examples about how to set up an inference service. So yeah, I made a I made a namespace all case serve context, and then I just set up one of the sklearn yamls for an inference service which is very similar to what I just showed you. I already had it applied because I didn't want like any issues or I just have to wait for it to come up and I show that it's already running. And here, I'm just like looking at different resources. So this is the inference service resource. You can see it's ready true and latest is like 100% of the traffic is going there. I'm setting up the ingress gateway because I want to port forward and send a request to the predictor to get some prediction results. Okay, so I port forward there and I set my ingress port. Here's my input. It's just a very simple input that I'm going to submit into the inference service. And now I curl it and you can see under predictions, we have an answer. Okay, so what I'm going to show next is a quick example of using a canary rollout, which is one of the features of case serve. So here, I'm going to add a canary traffic percent at 10. That was an annotation onto the YAML. And I'm going to apply it. So now it's configured. So what we should see is that 10% of the traffic is going to route to the new version and 90% will be left in the old version. So here, I'm checking and you can see now it's ready with 90 on the previous version and 10% of traffic on the latest version. And then if I remove that, I can promote this version to, okay, I can promote this version to 100% traffic. And because of time, oh yeah, okay, now it's good. Yay. Okay, there's a little demo. And really quickly, there are a lot of cool new features coming out. I'm just going to leave this up because we don't have a lot of time, but be sure to go look at on the website and you can look at all of these features and also on case serve. They have them listed as well. Congrats to everyone who's worked on this. I know a lot of people have put a lot of hard work into this and thanks to the case serve contributors. It's a great community to be a part of. I'm so glad that I joined it and I'm having a great time. So yeah. Also here are some other talks that Bloomberg is going to be involved in and here's a career page link if you're interested. So thank you for coming on this journey with me to learn about case serve. Now, do we have any questions for Alexa? I want to hear. Let's do this. Take the microphone over. What do you use AI for in Bloomberg? Is it for predicting prices of assets, analyzing news maybe or something hot like this? Yeah. So the question was like, what do we use AI for at Bloomberg? So Bloomberg is a financial services company and we use it for a lot of things like news. News is one of one use case for it. Yeah. A lot of financial models. Can you please talk about the differences between seldom core and using KF serving or case serve in its new name? But can you repeat that? Seldom core. You know Seldom core? It's also a possibility to run deployments and AI models using Kubernetes API. Okay. Yeah. So the differences are the question was, what's the difference in Seldom core? Seldom core. Okay. Sorry. Seldom core and case serve. That's a good question. Why don't we take it offline so I can make sure I can give you a good answer. Hi. Thanks for the talk, by the way. I have a question about multi-model use case that you talked about, the mesh. So in this case, whenever a request comes in, is it at run time, the model is pulled or just at the moment when the request comes in? Or how does it work? Because I see that, okay, that there are models, you said that the initializer is replaced with something called puller or something like that. That happened when the pod comes up or when the request, API request comes in for prediction. Yeah. So in model mesh, the question was about the differences in the storage initializer and the puller. So in the model mesh, the storage initializer is replaced by the puller, which is a long-running sidecar. So it's able to pull models as needed as you're hitting that model. Yeah. Hello. So you talked about the transformers and I was wondering, something I had been recently exploring was adding the pre-processing steps to the graph itself. So is that something possible with K-Serve? I'm sorry, could you add the pre-processing steps to what? Adding the pre-processing steps to the model itself, something like TensorFlow transform. Is that possible with K-Serve? Could you shed a little light on that? Yeah. So K-Serve allows for you to do pre and post-processing in the transformer. You can use some of the things that come out of the box or you can create your own custom one. So there are some options there as to what you can use for the pre-processor. Thank you for the talk. I see you talked about scale to zero. Yeah. Do you have to use that? Like, what if you have an application that you all want to have, you know, always up or it's like mission critical? Like, is that configurable or do you have to do that? Yeah, definitely. So scale to zero is something you can configure. Sometimes you want it to save resources but for other mission-critical things you may not ever or things that can't handle the latency of a pod starting up when it's been down. But for those use cases you wouldn't want to use scale to zero or something that used to be up all the time. Who do they give the microphone to? I'll have that back. I'm afraid we don't have any more time for questions. Thank you very much, Alexa. Thank you. Give her a big round of applause. Thank you for coming to my talk.