 Hi, everyone. Good afternoon. Thank you very much for being here and being interested in case serve. I also know many of my team team members are watching this session virtually and on a thanks for my team members and my the contributors to case of. My name is Ujre. I am the team lead of data science runtime team of Bloomberg. Our team provides a data science platform with functionalities to support what Bloomberg internal users need for machine learning model development life cycle. Those functionalities includes how to do data exploring using Jupyter Notebook, how to train models using popular frameworks such as TensorFlow, PyTorch, SQLer and extra. We also manage experiments and do online inference. Today I'm going to talk about the open source project we initiated together with multiple collaborators named case serve. We also run into recent problem. We want to deploy many models at scale and I want to talk about how we address the scalability problem. First, I want to give you a little bit background about case of case of is previously known as cave serving. If you are already familiar with cave serving, if you are already deploying cave serving into your platform case of is the same as cave serving. It recently moved to independent organization and it now enjoys more tournaments. Case of has experienced multiple important milestones in September 2019. We released the first version of case of as a sub project on the cook flow. It was released under the name cave serving. A few months later, we introduced it at coupon us and we spend the next year and half to develop this project and we released its stable version. We want better one version into 2021. Finally, last month we renamed the project to case of it is a sign that it has reached the next level of maturity. Here are linked to two important articles and blog posts. You are very welcome to follow the links to learn over announcement together with cook flow community. And also the blog we posted on Tech at Bloomberg that talk about over journey of building on production grade machine learning model serving solution. We won't be able to achieve all this without all the awesome contributors here. I want to spend a few seconds to acknowledge everyone who have contributed and collaborated with us for those of you who are less familiar with inference. This diagram demonstrate a typical and also simplified model development life cycle. Normally over user are data scientists and machine learning engineer. They will first prepare data and use a machine learning framework to train a model. And once the model is trained over user want to deploy this model into a production system and this model should be able to answer real time questions such as given two sentences. Can my model tell me what is the similarities between these two sentences or given a news article the model should be able to run inference and let us know what are the topics related to this news article. So how hard can it be? It turns out building a solution like this is very, very difficult. First we want to think about the cost of deploying such a machine learning model. How much CPU, how much memory, how much GPU resource is needed and when there's no request coming into the service is there a way to automatically scale it up or scale it down. We also want to monitor over machine learning services. We want to think about how to do readiness check, how to do life-ness check. We also want to produce permit this metrics where which we can use to build dashboard and setting alert for. We also care about secure secure rollout. When we build a new version of the service and that break production system, how can we automatically detect that and stop the rollout? Can we canary a new version of a model and compare the result and being able to swap traffic between the two different version of the model. We also want to define protocol inference protocol. There are many different types of model server in the open source world and how can we allow user to have a consistent layer of protocol to send request using gRPC, HTTP and Kafka. There are not all the users want to run inference using real time request. Some of them only want to run end of day batch inference. So how can we support this type of users? There are also many different type of machine learning training framework. So when we do the serving step, we want to be able to serve various models trained by different frameworks. After model is running inference, the next step will be how can I explain my prediction? So we also integrate an explainer component into K-Serve. So here comes K-Serve. So K-Serve is a highly scalable and standard based model inference platform on Kubernetes for trusted AI. At the lowest level is our compute resource. Normally in the Kubernetes cluster, we have a collection of CPU, GPU and memory, sometimes even TPU. Those are the resources we have for computing. On top of all the computing resources, we run a Kubernetes layer. Kubernetes is used as a way to orchestrate and manage all the compute resources. On top of the Kubernetes layer, this is over serverless layer. We use Knative and Islil to build the serverless layer. With this layer, we are able to automatically scale up and scale down according to the incoming traffic. We are also able to scale down the number of parts to zero when there's no request coming in. So we can release the compute resource when it's not being used. At the top level is the machine learning integration layer. We integrate with multiple popular model server in the industry. So K-Serve is able to serve machine learning models trained by various machine learning frameworks. This is the diagram that explains the major component in K-Serve solution. If you're already familiar with Kubernetes resource, you must understand that most resource is represented by YAML file. So K-Serve is the same. We define a resource called inference. The case of user can describe what kind of machine learning model they want to deploy into a system. And this request will be handled by Kubernetes API server stored into FCD. And it's eventually being reconciled by inference service controller, which is the main controller of K-Serve. After inference service controller reconciles the incoming resource, it will create the underlying major component. One of the most important component we create is the predictor, which is essentially the model server rerun. With the predictor component, the model server can handle incoming requests and run inference result. The second important component is transformer. Sometimes user want to implement, customize pre-process and post-process steps to translate data point into a format that the predictor can understand. And then post-process it back to a format that the application called understand. So that's how the customized implementation can be integrated into K-Serve. We also have an explainer component, which uses Alibis explainer that can explain why an inference result is produced. A very critical part of K-Serve is we define a standard inference protocol. This standard, we work very closely with multiple model server community, including Triton, TorchServe, and ML server. We make sure the case of community can come up with a set of consistent inference protocol to provide a unified user experience. This is a set of HTTP protocol we have defined. As you can see, we have the standard protocol for model server to do life-ness check, readiness check, and to check if a model is ready to take incoming requests. We can also use this protocol to check a server's metadata, a model's metadata, and of course, most importantly, run an inference. Similarly, we have a set of GRPC protocol we can use to check the health, state, server metadata, model metadata, and inference. With the standard protocol, we can easily integrate with multiple model server and the client set can set requests consistently. So now we have already run K-Serve in our production environment for a while and now we start to run into new scalability problem. How do we deploy a large number of models in production? So let's take a look how the current approach works. How currently K-Serve deploy a machine learning model? The gray box here represents inference platform cluster and in this cluster, each of our users own namespace which is represented by the light blue box. In their own namespace, they can run multiple inference services. This inference service will fetch a model from external model storage and the model storage can be BCS or GCS or even HTTP service. Once the model is downloaded into the inference service, it will open up an HTTP endpoint where a user can send request data to and get an inference result back. So what kind of problem does this approach bring us when we want to scale up a number of models? Because if we want to scale up a number of models, we potentially need to scale up the number of services we run in our platform which doesn't scale very well. And I will talk about the scalability limitations we foresee when we want to, if each team want to run hundreds, even thousand models in our inference platform. Those are the limitations we are already aware of and of course there may be other limitations we may run into. First, I want to talk about compute resource limitations. In each inference service, it comes with a certain amount of resource overhead. We have a side car that run alongside each model server that handles the incoming request, produce permissive metrics. It can also do certain batching and logging. So those side cars need a certain amount of CPU and memory to run alongside the model server. So the resource the side car requires is configurable, but let's, for example, let's think each side car takes 0.5 CPU and 0.5 gigabyte memory overhead. Based on this configuration, if we deploy 10 models, let's say each model has two replicas, then each model's resource overhead is around 1 CPU and 1 GPU per model. But if we can figure out a way to load 10 models into one inference service, then on average each model's resource overhead is about 0.1 CPU and 0.1 GPU. That's a lot of resource reduction. The second limitation we foresee that we are going to run into is the maximum part limitations. Some of you may be already familiar with the Kubernetes default setting. On each node by default, we can run 110 parts. And based on the official documentation from Kubernetes scalability best practice, we shouldn't really run more than 100 parts per node. Based on this limit, on a 50 node cluster, we can deploy around 1,000 to 4,000 models based on the number of replicas we want to configure per model. The third limitation we foresee ourselves will run into is the maximum IP address limitations. I think a lot of you also understand each part has an independent IP address in Kubernetes clusters. The IP address is assigned to new models, replicas of models. If you run transformers in case of there need to be IP address assigned to transformers and explainers, let alone there are also basic Kubernetes control plan parts running in the cluster. The number of IP address available in each Kubernetes cluster varies a lot depending on how the admin manage this cluster. But I want to point out that this is in one of the testing cluster we run a test. We have several thousand IP address available and based on that limitation, we can run several thousand models. So in order to solve the scalability problem, we work very closely with our collaborator from IBM and we come up with a solution called ModelMesh. This is the diagram of ModelMesh solution. So let me walk you through it from the top to the bottom. At the top it is the machine learning application which sends inference requests into ModelMesh. And one ModelMesh can contain multiple serving runtime. Different serving runtime is essentially a different type of model server. In this diagram there are two serving runtime available. Different serving runtime can produce two serving solution for different type of machine learning model. One critical component in this diagram is the mesh and pull-out sidecar that run alongside model server. So the light green box and pink box here represent different model server. And you can see that there are mesh and pull-out sidecar running alongside it. The sidecar will decide when and where to load and unload models based on the usage and the current request volumes. If a particular model is under heavy load it will be scaled across more parts. You can see those little circles inside the serving runtime they represent different models. If we take a look at ModelB in the blue circle it is scaled to have two replicas. So comparing to ModelA which is in the green circle it can handle more inference requests. The mesh sidecar also acts as a router. The model mesh store a model to pod ID routing table in the external LCD. So when there is inference request coming in the sidecar will look up the routing table and it will figure out which model is loaded into which pod ID and routes the request to the correct pod according to the routing table. So you now may be curious to learn what kind of serving runtime model mesh can provide. So out of box integration we will provide Triton inference server which is developed by NVIDIA. This model server can serve a machine learning framework, a model trained by machine learning framework such as TensorFlow, PyTorch, TensorRT, or Onyx. We also by default integrate with Soudn's ML server. This Soudn's ML server is a Python-based server. It can serve frameworks such as SK Learn, XGBoost, or like GBM. So a lot of you may be very curious to know what kind of performance of model mesh can provide. If we collocate multiple models into the same pod will that have impact on the latency of throughput this solution can provide. So we did a performance test. This performance test was done on a single node 8CPU 64GB RAM cluster and we deploy a very, very simple stream model. It only has around 700 bytes. So if we use the traditional one model per container deployment approach, we are limited by CPU and we can approximately deploy around 40 models in this testing cluster. Sometimes if we deploy into a larger node we'll be eventually limited by IP addresses. But now if we move on to use model mesh deployment we are able to compact around 20K models into this testing cluster and essentially run into memory limit. In addition to the density test we also did a latency test. This latency test is done by running two tritons serving round times and we gradually increase the QPS from 25 per second all the way to 2800 per second. Each performance test will run for 60 seconds for each QPS. We also gradually increase the density of the model mesh from 1000 to 1000 all the way to 20,000. As we increase the density of the models we notice that there's a slight increase in latency. But for single-digit millisecond latency inference one worker node can support about 20K models for up to 1000 QPS. I also like to point out that this performance test is done on CPU nodes. Normally when we run inference on GPU the performance can be increased dramatically. Now model mesh is already released as part of case of 0.7 deliverable. So you are all welcome to check it out and try model mesh. There's still a lot of work we want to continue to work on for the model mesh. So here's the roadmap we have in mind in Q4 2021. We want to have better inference and serving runtime integration. And currently in order to use model mesh each username space needs to have its own model mesh controller. So we would like to enhance this model mesh controller to support multiple namespace. Currently the model mesh only support downloading model from S3 storage. We want to spend time to expand the storage we support including GCS and HTTP service. Next year Q1 we want to spend some time to work on inference graph. We also start to extend model mesh to support transformer so user can plug in customized pre-process and post-process implementation. We also want to make model mesh start to support cannery rollout and eventually consolidate model mesh controller with the case of main controller. So you are all welcome to contact us by visiting our website and check out our github or Slack us or just talk to me after this talk. Now I am open to answer questions. I have two questions actually. You mentioned several types of constraints like CPU and memory constraints for CPU only inference. How does this picture changes when you start using GPU assistance inference? I would imagine there would be another set of constraints on top of this. The one you mentioned already. Can you repeat the second part? How does the picture changes when you start using GPU assisted inference in terms of the constraint in the system? I am sorry, can I work closer? How does the picture changes when you start using GPU assisted inference? What kind of constraints or what kind of problem or bottlenecks you start seeing? So thank you very much for the question. So the question is that I mentioned that there is a compute resource overhead come with each inference service. The question is that if we start to use GPU what's the change, what kind of change it is about the overhead? So there is nothing changed. So in this, for example, if we take a look at the diagram, the model one will be deployed into a model server. The model server is the one that requires CPU or GPU for inference. So that's one independent container run inside the inference service. And alongside that model server, there's another container. That's the extra container running as a sidecar requires an extra CPU and memory. So changing the model server from using CPU to GPU doesn't really change the sidecar requirement for compute resource. Yeah, but I mean, if you start using inference, then the GPU is going to be a bottleneck in this inference, right, depending on it load. So you're effectively introducing another level of bottleneck or possible bottlenecks in the system. I was curious if you can, if you saw those bottlenecks in inference and how do you address those bottlenecks? Do you mean when we start to use GPU? Yeah, when you start using GPU, yeah. So do you mean if we start to use GPU as a way to use GPU resource to address those overhead? Yeah, because it's a very specific resource. Yeah, that's a very good question. So if some of you, this may lead to some discussion about slicing GPU. So currently, if we want to request GPU resource in Kubernetes, there's no very straightforward or easy way to request a slice of a GPU. So I know in the managed Kubernetes concept, it's very easy to think about, well, I have a GPU allocated to this inference service. Can I request a slice of the GPU and just to use that for the sidecar? The answer is that there's no very easy way to do that. So when we request GPU, we request a full GPU. Actually, in real use cases, very often we notice that even though a model server requests a full GPU, but when the inference happens, it won't really use the full capacity of the GPU. Sometimes it doesn't even use the full capacity of the memory. So if we take a look at typical GPU, most of them come with 32 gigabyte memory. Some models actually really small. They are like micabyte level, some larger ones like a BERT model, maybe it's gigabyte level, but that still wouldn't that only consumes a small fraction of the full GPU memory. In terms of the computing power, computing power sometimes is even less. The amount of computing power you consume heavily depends on the volume of request that you send into the model server. So when you have a lot of concurrent requests sent to the model server that has the GPU allocated to it, you can notice that the GPU consumption will go up. But during the time that there's a very small amount of volume coming to the GPU powered model server, like it's very obvious that the consumption of computing power decreased a lot. I think that also really depends on what kind of model server requesting that GPU and how optimized that GPU has. So that's why we work very closely with Triton because Triton is developed by NVIDIA and they have a lot of optimization about GPU in the Triton model server. All right, thank you. One more question and then we'll have to take our break. But as I walk over, I'll just remind everybody that after this we have a break until 255 pacific time and then the sessions will start back up again. In this diagram you're talking about single model serving versus multi-model serving. So in multi-model serving we're saying that multiple models running in a single Kubernetes pod is that correct understanding? Yes, that's the correct understanding. So what we're trying to do is to collocate multiple models that can be served by the same model serving runtime and collocate them together. But isn't it a little bit against the Kubernetes model reason I'm saying that one thing is all of these model competing for the resources because you don't have isolation of resources. So there is no quality of service. One model is taking more compute resources that will impact rest of the models. Second thing, if this pod goes down or this node goes down all the model goes down together. Yeah, that's a very good question. Each model can take up a different amount of computing service. Actually that's a we internally over a working group we had a long discussion. So the original design of the solution is that each model, let me go back to this diagram. One of the original idea we have is that each model will have the same amount of replicas that just spread among all the replicas belongs to a model server. And then we start to realize, let's say we have model A and B and C in the same pod and model A takes most of the request and B and C only take like one request per hour. But because of model A need to handle really high volume so model A will drive the same serving runtime pods up into like many replicas like 10, 20. But model B and C will be forced to scale up together. So that's the idea like we spend a lot of time discuss and that's why we move to the model mesh idea so we can scale up different model differently. So let's take the diagram again. The model B has more replicas. So essentially collectively model B owns more computing resource in this serving runtime and model A only has one replica so like collectively model A only owns like less computing resource. That's why we designed the solution in a way that each model can have different number of replicas across all the pods that belongs to one serving runtime. Okay, thank you very much and we will meet back here again at 255. Thank you.