 Happy Friday, everyone. Hope you all have been enjoying the KubeCon so far and also the rest of the time. So today, we are going to focus on the topic of model serving on Kubernetes. So my name is Dan. I run the inference platform at Bloomberg. I'm Theo Filos. I'm working for AWS. OK. So before I get started, I want to know. So today, we talk about really the last mileage after the entire machine learning lifecycle to deploy a model to production. So I wonder how many people here are able to get their AI model service up running on your laptop? Anyone able to have ever done that before? How long it would take? Minutes? Hours? How many of you are able to get your model to production? Cool. Fair bit. And how many of you are able to get your model running on Kubernetes? Oh, wow. More than what I expected. How many of you are able to deploy a large language model? OK. One, two, three. Cool. So yeah. So let's talk about that. So how a K-Serve is able to help you deploy your actually large language models, for example, at China GPT, which is kind of hyper. So we'll cover that later. So yes, as you think, the launch in the AI application pilot's relatively easy and deceptively easy. So you can probably load up your model on your laptop and then put it behind a fast API server endpoint. And you can standard inference requests and then pretty easily get a backup response back, right? But when you think about moving this application to production, it actually gets a lot more complex. And there's lots of things you need to consider. So it's not just a model. And oftentimes, you will need to do some feature processing to process your image or tokenize your text input. And you may need to deploy the featureization code as a microservice and to look up a feature store. And you may have your model on the cloud storage. And you may wonder how you can get that model downloaded onto your server process or a container. And how do you communicate between a feature transformer and the model service? And on production, oftentimes, you will need to handle a large amount of data simultaneously. So it's critical you should be able to scale up your inference service in a cost-effective way. And another top priority consideration is that oftentimes, your model may be like a content-sensitive data. So you need to secure your service endpoints and make sure the service communication between transformer and predictor is secured and the data is encrypted. And you also need to make sure your model deployment process is repeatable, reproducible. So in case there are some outage, and you should be able to replicate your deployment pretty quickly in a few minutes, and make sure it's reproducible. And it's also easier to roll back to your previous deployments. So last, not the least, after your model is deployed to production, you need to make sure you have all the available data, like tracing metrics and logs, to analyze your model behavior, and to make sure it meets your production SRAs, or analyze the problem pretty quickly with all the available observable data. Yeah, so Kubernetes actually is a great platform, which provides a container orchestration platform where it's really specifically designed for deploying microservices with all the capability of resource management, low balancing, and Kubernetes provides a really nice cloud-native way, like a YAML spec, which you can deploy your models with decorative YAMLs. So this really makes sure your deployment is repeatable and reproducible. It can run either on cloud or on prime in any cloud environment. And it also provides the horizontal scaling features. Both work on a CPU and a GPU device. And Kubernetes is also fault-tolerance, but it can detect the container failures. It's really resilient to all the outages, and it can minimize the down times. So Kubernetes is great, but it's more designed for developers. And AI engineers usually are not equipped with necessary knowledge for the necessary Kubernetes knowledges. So to bridge this gap, KSurf is designed for a highly scalable and a standardized-based cloud-native ML inference platform on Kubernetes, so which incases all the complexity for the Kubernetes deployment to simplify the production deployment. So KSurf currently can be deployed as a standalone. So you can have a serving component running on production environment in an isolated environment, sometimes because you may not want to mix the training and the serving model in the same cluster. But you can also do it for experimental or testing purpose. You can also deploy as an add-on component within a Kubeflow, which is a CNCF. Actually, it's currently moving to CNCF as well in both cloud and on-premises environment. So a little history for the KSurf project. It was originally introduced back in 2019 KubeCon in San Diego. Not sure if anyone here was happy to be in that talk. But it has been a while. So as you see, after COVID, so then 2021, so in 2020, NVIDIA contributed the inference protocol to the project. And in 2021, IEVM contributed the model match to the KSurf project. And in 2022, so as the project scope continues to expand. So because it was originally as a subproject, started with a subproject on the Kubeflow umbrella. And in late 2022, it becomes an independent project, which is currently hosted on the AIF foundation. So as you can see now, there are 193 contributors and 20 call maintainers. And there are 26 known adopting companies, which is listed. But there's a lot we may not know what they might be using. And according to the survey, there are 63% of Kubeflow users, which is using KSurf currently. And you can see all the logos, like it's the company currently using KSurf. So yeah, next I'm going to talk about the state of the project. So as of case of 0.10 release, the features we currently support. So it mainly falls into three pillars, call inference, advanced inference, and the model explainability and monitoring. So on the call inference side, it offers the feature transformations and inference with a set of pre-built serving runtimes, which is shipped out of the standard case of installation. It also allows you to build your own serving runtime with the custom Python SDK. You can build your custom serving runtimes, which can follow all the same standardized inference protocol. And on the advanced inference, we currently support a model mesh, which allows you to deposit thousands of models at scale. And inference graph is used for the use case where you want to build a more sophisticated ML pipelines with multiple models changed together. And for model explainability, we KSurf offers text, image, and tabular explainers. And for more attorney, it supports buyer's adversarial outlier and drift detectors to make sure your model is able to produce reliable predictions on production. Next, let's take a look at the core concepts in the KSurf, which is serving runtime and inference service. So inference service allows to specify a model format and the versions used for the given train model. And the serving runtime is really defines the templates for the parts, which is to serve the particular model formats. And so the serving runtime is really for KSurf admins. You may need a little bit of cumulates knowledge to define the parts back, how to serve your models, and specify the container images and all the arguments and variable variables, which is necessary to launch the inference service. And so on the user side, it's a lot simpler. What you only need to specify is the model format and the versions of the model. So in KSurf, we'll be able to automatically select based on the model format you specified on inference service, and to use the right serving runtime to launch your model. So as you can see from the matrix, we KSurf does support a wide range of serving runtime, and also which covers various different ML frameworks, like Cycland, SG Boost, TensorFlow, PyTorch, Onyx. And you can also build your own custom model format. So we do, as KSurf, support a variety of serving runtimes. So it's really important to give user a cohesive inference experience. So KSurf developers inference protocol, which enables standardized high performance like a data plan. So we recently, S0, we see a lot more adoptions for this protocol, which is currently already implemented by a certain ML server, Triton inference server, TouchServe, OpenVINO, and AMD inference server. So we recently renamed the inference protocol to be an open inference protocol. So the idea is that we have the core inference protocol, and then we encourage people to have a forum to keep improving this protocol, and have everyone or the industry becomes more like an industry standard inference protocol for everyone. And so this really allows the interoperative between all these serving runtimes. And more importantly, it enables the capability to building all these client and benchmark tools, which can work with all these serving runtimes without changing your client or tooling for if you want to experiment with different serving runtimes or models. So the repo is just up as of yesterday, so check out. And so here is an example where you can see with a Triton inference server runtime and case server, it can cover almost most of the popular ML frameworks. So in this case, you can send the same inputs. It can work with TensorFlow, Triton, TensorFlow, PyTorch, or SGBoost, SQLen. It can use the same input, and then it gives the same output back to you. So in this case, when you do the experimentation, you can easily switch between different serving runtimes or different even model formats. So the open inference protocol defines mainly the REST and GRPC protocol, which is mainly for the trade-off for easy-off use versus high performance. So it defines the model server health and model metadata and the core inference endpoint. So let's take a look at the most important endpoint for inference. So it defines the input and output. You can see a tensor, a schema, where you can define the shapes of the tensor and the data. The data usually needs to be flattened into the one-dimensional array when you send over the while. And so data science might be familiar with the common data types like a non-py array or a pandas data frame. So for a non-py array, you can easily translate it into tensor inputs with a considered data type if the shapes are the same. But if you're in a pandas data frame, you may have mixed input types with integers, flow types, or strings. So in that case, you can define multiple inputs. Each way can be different shape or different data types. So we do provide the different codex, which can easily allow you to translate between the non-py array and pandas to the input tensor, which can be theorized on the while. OK. Next, I'm going to hand over my course to talk about more exciting stuff about KServe. Thanks, Dan. Can you hit OK? So here on the left-hand side, we have the YAML definition where we can see the transformer and the predictor, where we define, for example, this transformer is connecting to a feature store to fetch the feature vector and send the request to the predictor to do what is supposed to do nicely, to do the matrix multiplication between the features and the weights that we have loaded in the model server. So let's see what is the role of KServe here. The KServe controller is reconciling, by looking at the inference service, is reconciling a Knative deployment and a Knative serving deployment, which is our core service that we are using to deploy our containers, our pods. So on the left-hand side, we have the pods for the transformer and on the right-hand side, we have the pods for the predictor service. So the predictor service, when it's instantiating, when it's initializing, it's using what we call a storage initializer that is loading from the model storage, the model file that has been saved after the training process. So the model serving process is loading this artifact in the memory of the CPU and then transfer it to the memory of the GPU to do the matrix multiplication. On the top, we see the AI application that will send the inference request to the transformer service. And the transformer service, in turn, will transform it to an umpire array of integers or floats to invoke the predictor. So this is the flow. The AI application is calling the transformer service and then the predictor service gives the prediction back to the transformer to send it with processing by adding some extra things in the request. Now, let's look at the model storage pattern. We have the storage initializer on the left, where if we have a directory stretch, in our cloud storage, we can save different versions of models or different types of models in such a directory structure. And then you can load these files from the storage initializer to the storage of the pod. And since it's loaded in the storage of the pod, the other container, which is in the pod, the predictor, will load it in the CPU memory and in the GPU memory. The same with the model puller container. This is the pattern that we are using in the model and in the message we will see later. There are cases where we see customers or companies or users that have a lot of models. And it doesn't make sense to use one model server per model. For example, some model files can be one megabyte file in size. Why should I bother to use a whole pod for that one megabyte model? So we have the concept of multi-model server where we can host many models in the same model server. Use cases where maybe some data scientists are building models per user. Imagine models per customer if this is a pattern for some use cases. And if we take the multi-model server in steroids, we do have the model mesh that is making an agnostic to the model infrastructure where many model servers can run in parallel and receive the models in different versions, different architectures, different formats to host them. The same way we are using the service mesh, we can also consider the model mesh as the advanced multi-model serving. Now, last year we were excited to launch the inference graph. It's very popular among data scientists who want to not only have one model to do the inference, but use a series of models to perform a sequence of tasks. Let's say, an example model would be that we have a classifier for animals. And the first model is checking if this image that it's receiving is a cat or a dog. Then if the outcome of the prediction is that this is a dog, maybe we have a downstream model that is doing classification of what dog breed is this. So conditionally, based on the output of the first model, we can route the traffic to the next model to do the classification of a more specialized model. So this is the inference graph. And it's very helpful, imagine it like a dog, that you define the route or the conditions that it can navigate in a sequence or in an ensemble pattern. A KServe plays nicely with the rest of the CNCF projects. We already mentioned KNative, which is the core project that we are utilizing, but we are also utilizing Istio for having the ingress gateway and many other projects that help us with the security, with tracing, with the logging, with the metrics, and with faster transportation across the inference requests. KServe is not part of CNCF, but is part of Linux Foundation AI and data. But we would like to see KServe being part of CNCF because it plays nicely with the rest of the ecosystem. Let's see how we are using the KNative example. So after the inference request is coming to the ingress controller, sorry, to the ingress gateway, the ingress gateway will route the traffic to the KNative service. And the KNative service, in turn, will send the request to the Q proxy, which is the sidecar of the envoy that is running next in the pod together with a transformer and do the prediction and give back the response. There are cases that you might have many models that are different versions or different formats that you would like to try. And you might want to have serverless capabilities for model serving. KNative is empowering us of using such capabilities because it supports scale down to zero. This is not something that you can get out of the box from Kubernetes with the horizontal pod autoscaler. And the KNative autoscaler is a powerful tool that we love using because we can have our models deployed in our cluster, but turned off. And when there is a request coming, the autoscaler of KNative will hold it, will receive it, but it will hold it. It will instantiate the model version that needs to serve this particular request and send the request to the transformer in the predictor and give back the response. This is powerful and huge, very important capability when you are doing development and testing of different versions or when you want to do A-B testing across the next version or the previous versions that you had. The service mesh capability that we inherit from Istio is also powerful. And if you are using it, you have also most probably seen the benefits because with the sidecar of Envoy deployed in the pods of our transformers and our predictors, we are able to observe this traffic with the monitoring capabilities that Istio is giving us out of the box with Yager, for example, to see how is the inference graph traversing across the predictor and the transformer and the request is coming back. This is powerful for troubleshooting, also not only for tracing, but also for logging to have the ability to send the requests that are coming to the Envoy proxy to a third party logging provider, also for metrics. Istio is giving us out of the box the capability with Prometheus metrics that can be stored in another storage. Istio ingress gateway also handles the authentication mechanism. So when the JWT token is coming from the SSO or the mechanism that provides the authentication and the authorization is routing it to the appropriate services and Envoy is also making sure that the communication between the transformer and the predictor is authenticated with Mutual TLS. Further in security with the use of SPIFI and the implementation of SPIRE, we have this secure identity framework that the identity providers providing this token that is required by an application that might run on a VM to perform this inference request and through the attestation process, the credentials are being issued for this particular application. Finally, on the observability stack, we are using all these tools already mentioned with Istio and the Knative that are giving us the capability to do the tracing with Hotel to use the cloud events so we have a cloud agnostic capability to store the logs and transfer them as events maybe to a bus or a broker that can be further consumed by downstream models like an explainer or a bias detector or some simple model monitoring techniques. But cloud events is the core component here that helps us transfer these events, these inference requests and especially the responses which is the new treasure that you are generating by serving models. We want to make sure that we are storing them properly and also doing more things with it. Like we're having class imbalance and do I need to retrain my model because I see that over time this class has more predictions than the other, et cetera. With Jagger and Grafana, of course, these tools out of the box are giving us the capabilities to observe the information that is stored through these mechanisms. Buildpacks, of course, is also helping us, is helping the developers forget about the docker file. They don't have to build it and the security departments of the organizations can apply their policies by letting the developers or the transformers only focus on doing their work on building the Python code for the transformer rather than having to define the libraries and the requirements for the transformer. Now, these are the things that we have already, that are already available and we need to start looking at the future. The first thing that we need to share with you is that we want to, in the roadmap of going to the version 1.0, we want to graduate all these capabilities that we currently have into the 1.0 version. This is the goal for this year, for the community of the project and hopefully will be realized at some point. Now, let's take a step back. In 2019, when we launched K-Serve, we had the BERT model. This is how, so see here in history how there is a logarithmic scale of the number of parameters that we see in the model sizes. We have more and more large models and K-Serve, when it was built, it was there to host models of, I don't know, 300 million parameters, which is like one gigabyte file. Fair enough, you can load it in the CPU memory, you can load it in the cool GPUs, but now we have one trillion parameters from GPT-4. How do we even host this in a server that requires 400 gigabytes of parameter values? How do you serve it and how do you make sure that it's fast enough? So we do believe that the future for the project is to support these language models and we will discuss with you some of the capabilities and some of the challenges and some of the tools that we have to help you get there. So first of all, because we have hundreds of billions of parameters, we have large size, yes? And this means large transport costs, large time to complete this transfer, but also how to break it down to different GPUs is a challenge. So the NVIDIA Fasted Transformer can allow us to use the Accelerator son engine and support many GPUs and nodes in a distributed fashion. The Hugging Face Accelerate is a mechanism that is splitting the file, let's say the 400 gigabyte file of the GPT into smaller chunks that can be loaded in the sorts of the GPU to calculate it in the distributed fashion. The Tensor Parallelism is allowing us to do that across multiple GPUs in the same node. And finally, the Pipeline Parallelism is what is allowing us to do that across multiple GPUs and multiple nodes that are running GPUs. So this way we can stack many nodes, the one after the other. And all their memory is loading the weights with the chunks that we have mentioned. Now, some of the challenges that we see by deploying such models are the following. Since the cost of the transfer is large, we need to make sure that somehow maybe we are doing caching. We need to take into account also the latency. How do we make sure that the inference request that will do the matrix multiplication across all GPUs of all nodes will be that fast that the user will not wait? When you use GPT, you don't wait a lot for the response. It gives you the response within a couple of seconds. So an example model file is the Bloom model. This is a large language model that is available through hugging phase. This is a almost 400 gigabyte size. It has 176 billion parameters. This file with the help of the Accelerate can be split into 72 chunks or splits of five gigabyte file. And then the normal GPUs that we have in our normal nodes can handle this, can load this file. And then the distribution is happening on the inference request. Other challenges are the transfer cost or the transfer time between the cloud provider storage that you have, like S3, where you load the model in the pod of the container that is running in the node. That takes time. And we have seen implementations and contributions that people are casting this model into NVMe storage, for example. The same challenges are with transfers. Now, the faster transformer or Triton is a pod itself, is an image that is 32 gigabytes of size. This is another challenge that we see. How do you even load such a container? We are struggling to make the containers a few megabytes, not to have them in the size of 32 gigabytes. So maybe as a demo here, we have time? Okay. So this is an inference request to send it. If it's working, let's try. So we have an inference service here with a predictor in the spec that defines that is the model format is Triton. And we have the storage URI where we have saved our GPT model. And the transformer is defining the path for the tokenizer that will transform the text into integers or tokens that will be in the input of the model. If I would, and this is a custom container that you can also find the code in the sample directory of KServe. If I will send the inference request, I already got back the response. The inference request is a post to an endpoint that is running the Istio ingress. And the input says Kubernetes is the best platform to serve your models because, and I want 15 tokens back or 15 words back as an answer. And the model is telling me, is doing some reasoning and telling me that because it's a cloud-based service that is built, et cetera, et cetera. So this, and of course, we can also see the log files if we want on the pod to see that the request has been received. These are the pods that are running with the 32 gigabytes of image. And if we look at the log of the KServe container, for example, we can see the challenges that we have into loading the model file within a matter of a few minutes. So from 31 to 33 last night. Let me go back to the slides. So these challenges, I don't know if this is visible, but I have a sample here of a three gigabyte file that is downloaded from a, from hadging phase. And then through the use of the faster transformer, we are splitting this file into smaller chunks. And it's dense layer is saved in a separate file of a 16 megabyte file that will appropriately be stored in the GPU specific chart that we have got from our model server. Other challenges, the storage initializer takes, let's say one minute to load a two gigabyte file. The predictor takes 10 seconds to copy the model file from the pod storage to the CPU and the GPU memory. And the sarding technique is visualized like that. So Dan, would you like to continue? Use case care for the GPU model. So as some of you know, so recently Bloomberg published the, the Bloomberg GPT, which is purposely built like with the 50 billion parameters, large language model, which was built like a trend from scratch for finance domain. So it actually outperforms the general model in the, in the financial, financial tasks, but without the sex to fight the, sex fighting the accuracy on the general, for the general task as well. So the model is trained approximately 53 days on 64 servers, each with eight, a 100 GPUs on 806 stage maker. However, the model is actually served internally on prime because of the data privacy issues, as well as the cost. And so, so the model is deployed on the NVIDIA Chaitan faster summer as the zero just demoed pretty similarly. So the serving the model is on two, a 100 GPUs for each replica. Each GPU is with like 80 gig GPU memory. So the, so the model is the converted to faster transformer format from the hugging face checkpoints, which is, we set the target precision to BF 16, which gives us a better accuracy. The tensile parallel is set to two and a pipeline parallel to one, which means it's shouted, the models are shouted on across two, a 100 GPUs. And so you can check out more details about the Bloomberg GPT on the price and the paper linked to here. And so this is like how the diploma looks like, which is like, I think you already showed the YAMMOS, this is like from the, and we have the pre-processing with the tokenizer, which it takes the text input and then converts to input tokens, send over by like a GRPC to the Triton faster transformer predictor, and it gets back the generated tokens. So another thing we are kind of working around is to streaming the token back because sometimes it may take a long time for a large number of tokens. So it will make the user a little more interactive to get the tokens back while running the infers. So as you can see the models are shouted like on two, a 100 GPUs. And you can see some metrics we got collected from the models are, we used about like a 65% of the GPU memory and then sometimes the GPU utilization can really hit 100% on a 100 GPU. So the latency is around like a few seconds on average. So yeah, that's a conclude our talk and we really want like having new contributors join our project and then pave our way to the 1.0 like a version roadmap. So here are the links and if you have any questions and then feel free to reach out to the community to see it or me. Yeah, I'll open up the questions. Thanks everyone. It's poking. Yeah. I have a question about the model mesh. Currently we're using, we're trying to use a model mesh and then we struggled with the scaling of the runtime. Cause as I know currently there is no scaling. It's a... I believe the model mesh maintenance are currently working on the horrendous scaling for the, you're talking about scaling the same runtime, right? Yeah. Yeah, they are currently working on that. So check out that. I believe there's a PR already up. We talked about data just to scale it based on metrics, maybe it's good way, maybe it's not, maybe it's your way, but this is what we thought. And another question about the, we're using Triton. So we have Triton adapter that download the S3 and currently we can't use Irsa. It's your site. I know that storage initializer, I think it's really, really new using right now with Irsa to download from S3. When it's come to model mesh and Triton adapter, there is a roadmap for that. Cause it's really, it's make me fight with my security guys. Cause I'm using, I'm your user currently. I see. It's bad for me. Okay, okay. Yeah, maybe we can catch up a little bit after. Okay. Yeah. Thank you. Sorry, can you? I think that I saw some comment in issues that currently is there is no option and in the roadmap it will come. Maybe you know something that I don't know. Yeah, we'll catch up later. Yeah, we can catch up. Yeah. Thank you. Thank you. Is there any other question? Yes. Thank you for this presentation. My question is why don't you support a simple model like CatBoost and the table you presented? Yeah. Another flavor of like kind of boost kind of model, right? So yeah. So as you can, it's pretty straightforward to implement that. I think they're probably open. You should have seen that before. But yeah, I feel free to contribute if you think this is like a widely used or like, but you can always bring your like, you can implement it and because like, you can add additional like a seven round time. It's just like in addition to like the pre shipped like the seven round time. So you can always add your own seven round times. It's pretty flexible. You can like bring your own like round times. But you do think it's useful for the community. So feel free to contribute that. I have a second question is, does the multimodal serving support Confidation containers? You can do Confidation containers with multimodal serving. Sorry, asking the model mesh for? Can you, does it support Confidantial containers? No? Sorry, what's the question? Sorry, I didn't get that. Like in-class containers. In-class? It's a type of containers. What is that? Okay, cool. Maybe we can like talk after. Sorry, didn't get the question. By the way, there is ML server from Seldon supports CatBoost and you can load it in KSERF. There is a PR there that you can find the extra. If there like a ML server already supports that then we should be able to run it too. Is there any other question? All right. Thank you for coming here today. Thanks, Eva.