 Hello, everyone. Thanks for joining my session. So my name is Dan. I lead the Bloomberg data science serverless platform. And I'm also the working group lead for K-Surf. And I have been working with the K-Native community since 2019 and worked with a lot of the co-developers here to build the K-Surf, which is another open source project we are talking about today. Yeah, so today my topic is to discuss how we built the MRA platform on top of K-Native. And I want to discuss some of what works well in K-Native and what are the features that we are also looking for to push K-Native to the next level. Yeah. Unfortunately, my co-speaker cannot make today and also credits to all my kids to do all the slides and emissions. So first, what is the MRA inference platform? MRA inference platform provides a standard managed inference service that provides help to unify the model department across multiple ML frameworks. And it also simplifies the model serving and monitoring at scale on production cloud environment. So MRA inference service is a service that can generate predictions from a trained model in response to inference requests that can be a single request or a batch of requests. So besides the common problems like infrastructure challenges to deploy cloud microservices, so we also have unique challenges for deploying MRA inference services. So first of all, we want to enable the autoscale inference workload both on CPU and GPU workload. So as you know, the default accommodated HPA does not support autoscaling based on GPU. In order to support that, you need to implement HPA with custom metrics on the set of GPU metrics like duty cycle, power consumption, and GPU memory, which sometimes can be really hard to reason about how it makes the autoscaling decisions. So we are really looking forward, looking for a solution which can autoscale work the same way both on CPU and GPU device. And secondly, another important feature we want to do is when we're doing a model loadout, so we want to employ a safe model loadout strategy with deeper validations in addition to readiness probes. So this is crucial for enable a continuous deployment for the model updates without a human in the loop. And besides in addition to the request response style service, and there are also use cases where we want to perform the inference based on some event source like A3 Kafka. And we also need to forward request to a number of downstream analytics components which want to monitor the models to ensure that it performs the reliable predicates. And last but not the least, we also have use case where we want to have an inference platform which can change a multiple inference service together to get back a response, or it may need to combine the outputs from multiple services for the model ensemble use case. So why do we decide to build the inference platform on top of Knative? So Knative give us a very nice service obstructions for service networking and routing, and adding both request based driven auto scaling both on CPU and GPU device work pretty nicely. It also supports both scale down to and from zero. Knative also implements immutable revision tracking which allows for a split in traffic among multiple revisions for grouping and the canary route. And other nice features it provides is you can get out of box distributed tracing and metrics for free and low balancing. It can low balance based on concurrency each part to make smart low balance decisions. So in order to avoid the reinvent wheels all these already solved problems, so we decided to build the ML platform, inference platform on top of Knative so we can focus on to solve our unique inference challenges. So here is how the case server was. Case server was an open source project which was funded by companies like Google, IBM, and Bloomberg back in 2019 under the Kubeflow umbrella. It used to be a sub project and now we grow tremendously afterwards and now it's an independent project under the governance by the AFAI. So in the service mode it actually creates the Knative service to enable the service functionality like auto scaling, canary route, and eventing capabilities. So in fact, the Knative is actually installed by default in the Kubeflow. So it actually as a result it powers the tens of production model deployment currently because it's a huge user base of Kubeflow. So inference service is a Kubernetes customer source we created on the case server, which is a ML friendly user interface to allow people to describe ML deployments. So in a lot of time people just need to specify the model format and the model storage UI. And then they can deploy the inference service with a simple YAML. And under the hood the inference service gets translated into a Knative service, which runs the Outbox model server, which is implemented in case server. It downloads the model in the container. Then once the model is downloaded and then spins off the service in response to the real time inference request. You can also choose to use build pack or case of SDK to build your custom model server, which works pretty much the same way. So the case of control plan provides a few call inference components like predictor, transformer, and explainer. Predictor runs as a Knative service, which is in the main container around the model server and sit along with the Kube proxy, which exposes auto scaling metrics and controls concurrency. And we also have a model agent, which does inference-related features like logging the request and perform batching and then sends a request to the model server. And transformer is a component which transforms the raw input request and converts to the format model server expects, according to the standardized inference protocol. And explainer sends a request to the predictor to try to generate the human interpretable predictions, explanations. So let's first look at the most important feature Knative provides, which is the request-driven auto scaling. So the Knative auto scaler excuse and auto scale based on the request demand. So by collecting the concurrency and request rate metrics from the Kube proxy. And it supports both scale 2 and from 0. So it can be really useful when you deploy inference service on GPU device, which can save the GPU resources while the service is idle. And cold start is still a kind of problem for the MRE deployments on production environments, because usually it needs to download the model, which sometimes takes a few minutes. And then the pod gets started a few seconds. And in the case of actually sets the default main replica to 1, and you can also choose to set a bigger number on the production environment. So it can scale automatically to handle the burst in the peak time. So let's take a look how it's scaled down to and from 0 works. So why the service is idle? Knative controller actually writes the HTTP router to the Knative activator. And once you receive the receiving the request volume, and auto scale makes based on the request demand, it automatically scale up to the desired number parts based on the auto scaling metrics. And Knative activator stores the request until the parts are ready to serve the live traffic. So Knative, on the other hand, also scaled down to 0 after the default 30 seconds. So sometimes when you do testing or benchmark testing, you may want to avoid the cost of a spin up or down this part. So you can also choose to add the additional annotations to keep the part a little longer to avoid the code start penalty cost. So the Knative auto scaler makes the auto scaling decisions based on your concurrency target and the observed of the concurrency metrics. So let's say if your target concurrency is 1, which means each part can only process one request at a time. And if you are getting five requests average concurrency requests in your parts, then auto scaler will automatically scale up to five parts to handle your current traffic. So comparing to the Kubernetes HPA, so Knative auto scalers process addition and addition to the CPU memory metrics. It also supports the concurrency and RPA metrics. So it can auto scale based on your request load. It also can support a scale down to n from 0, while the Kubernetes HPA can only scale down to 1. In turn, also metrics scraping, activator, and Qpros actually push the metrics to auto scaler via WebSocket. So over time, it can react faster than the Kubernetes HPA, which has to carry the metrics from promises. And Knative, by default, calculates the average concurrency in a 60-second window. It also enables a panic window, which is 6 seconds, which can react faster when you are receiving burst of traffic. While the HPA used a stable five-minute window, so sometimes it may not be able to handle the large burst of the request. So next important feature we implemented in case of is the model loadout. So oftentimes, when you roll out a new model, you need to validate the model. If the model is performed actually with accuracy and before moving the traffic from the old model to the new model, so we found that a Knative Kubernetes deployment is often limited by its inability to stage the traffic. So in case of actually implementing opinionated two-way brookling and canary loadout based on the Knative revision implementation. So every time when you update the service, it generates a new revision. And in case of actually automatically tracks the last non-good revision by automatically tagging the revision that was rolled out 100% of traffic. Yeah, some of the limitations of default Kubernetes rolling upgrade, it has very little control over the speed of the loadout. And it's unable to control traffic flow to how it flow to the new revision. And the readiness probe often are not suitable for validating models and doing deeper and stress tests. And also, it's not able to check the external metrics to verify the model updates. And rolling upgrade can halt the loadout if something goes wrong, but it's not able to roll back automatically. So we have rolled out a new model. I actually want to stage the traffic on the stable version. Before I want the new version to be running, so where I can verify and validate the model. So in this case, I can add a canary traffic percent field on the inference service YAML, so 2-0. So in that case, the traffic is still, it will spin up a new version, but it will not receive the live traffic. So here's how it looks. So initially, you have model 1, which creates the k-native revision 0-0-1. You get an endpoint, which can process the live traffic. So now, you roll out your model 2, which is the revision 0-0-2, because we set the canary traffic percent to 0. So it doesn't actually receive the live track. But on the other hand, it generates an endpoint which was tagged with the latest. So you can use the latest generate tag URL to test the model. So once you're happy with the models, then you can bump the canary traffic percent to 100. Then it will move the track from the old model to the new model. So after you've added models, there could be still something goes wrong. So in case you want to roll back to the last model, so k-server actually automatically tracks the previously rolled out revision in the inference service status. So it knows what was the revision it needs to roll back to. So user can simply set back the canary traffic percent to 0. So in this case, the model traffic will automatically roll back to the previous version, which was tagged as the previous. So now the traffic will get rolled back to the previous version automatically. So this is for KNC or IFN. So this is equivalent commands. We need to execute to similarly to this process. K-server basically automated some of the steps by automatically tagging the tagging revisions and also track the last known stable version automatically for the user. But yes, this is equivalent commands you can use to run the exactly same roll out workflow. So now I'm going to execute a demo to automate all the roll out steps I just showed. So I implemented with using the Argo workflow to execute these steps. So first step is to create a new model with version 0.0.1. And then next second step is to after the first model is reading the ready status, it will update the model starting UIs to do and to roll out the model version 2. So now you can see that the first model is ready. And now it's running the second step to update the model. But I want to stage a traffic is still on the old model. So the new model doesn't receive any live traffic. So you can see that the new model gets the 0% traffic and all the live traffic is still processed by the old model. And the third step is actually to run the model of addition job. So right now it's a simple curve. Just verify the request is expected response. But you can also plug in your own jobs to do more advanced kind of run a batch request from a golden data set and verify all the models produce the accurate result or run a stress testing to make sure the latency meets your requirement. So yeah, so once the test job is successful and then we automatically bump the traffic percent, move the traffic from the old model to the new model automatically. So in this way, we can implement a continuous deployment workflow without human in the loop. So every time they've updated model and then they will run the model testing model validation job, and if it's successful, then it all automatically roll out the new model to production. Yeah, so now you can see that 100% traffic is moved to the new model. So another requirement from case service that we actually, in addition to the running inference, we also need to monitor the model to make sure it produces the reliable predictions. So this request actually requires event-driven architecture where you need to capture the original inference request and then forward to a set up for model monitoring components, such as odd liar, concept drift, and adversarial detectors. So Canada eventing provides composable primitives to enable live banding for event producers and consumers and also use the cloud event to standardize the passing the event data. So case service has a model agent cycle, which intercepts the request and then forwards the request to the candidate broker. So candidate broker starts the events in a durable way and you can have a set of consumers which you can subscribe to the brokers based on certain filter of the events it's interested in. So here we run a set of model monitoring components to run the end of the case on the inference request and then you can also use to generate alerts if there anything outlier or the data is drifted. You can generate alerts based on that. So another requirement from case service that we just talked about is we want the inference graph which wants to change multiple requests, multiple inference service to get back the single response. Or sometimes we also need to combine the outputs from multiple service, inference services to for the model ensemble use case. So Knative does provide a sequence and a parallel CRD, but it's mainly designed for async eventing. But here we more like want a request response style. So we decide to implement our own inference graph CRD which creates, implements a graph orchestrator to change the request and merge response from multiple inference service real time on the path to deliver back the final response. So on the inference graph, you can have a different type of nodes like a single service or you have multiple service based on conditions, based on all weights. And you can also run inference services in parallel and then merge the response at the end. And all these different nodes can be changed together. So it's very flexible. In both, you can pretty much compose any arbitrary inference graph based on this design. And I'm happy to discuss this with Knative continually if there's something that can be useful to the, can be more generic design, can be contributed to the Knative upstream. Yeah, that's all I have today. And both case server and Knative community are great community. So I think if we combine two communities, it will be really powerful. And then we can, looking forward to collaborating more with Knative community to push Knative to next level. And hopefully we can get a lot more exciting features there. Yeah, happy to take any questions. Anyone has questions for Dan? OK. So you talk about the auto-scaler. I'm curious if you need it to customize somehow or if you are good with the defaults and if you are hitting by any chance the panic window often. So the questions, is there any case where we need to tune the panic window, right? If you hit the panic window and also if you customize, if you, I'm assuming you're using the concurrency settings Yeah, yeah. But then I don't know because you put there the defaults, the 60 seconds and six seconds for the panic window. So I don't know if you are customizing that, if you are fine-tuning somehow. We are mostly using the defaults. We are more like based on cases we need to tune the target concurrency or the container concurrency fears. So yeah, I think the defaults works OK. I think that we just need to tune those concurrency fears based on different applications. Cool. Any more questions? Anyone? OK. Awesome. Thanks, Dan. Yeah, thanks. Thanks, everyone.