 Hello everyone, welcome to the talk today for accelerated and standardized deep learning inference with cave-serving. The presenter today is Dan Song from Bloomberg and David Groove from NVIDIA. We have two agenda today. The first part we will talk about accelerated deep learning inference with cave-serving. The second part we will talk about cave-serving V2 inference protocol. Deploying ML models at scale is one of the most challenges for companies learning to create value through AI. As models become more complex and more deep, this task becomes even harder. As a data scientist or ML engineer, I want to serve standardized deep learning models like TensorFlow or PyTorch with minimal efforts and at scale in a unified way. I may also want to bring in custom pre- and post-processing before and after the inference. Running inference on deep learning models can be slower on CPU. I also want to accelerate the inference by deploying models onto GPUs. GPUs are powerful computer source, but deploying a single model per GPU can underutilize GPUs. I want an easy way to serve multiple models behind a single unified endpoint, which can scale to hundreds to thousands of models. In order to save the computer source, I will also want to auto-scale based on the workload and allow me to scale down to zero when there is no traffic sent to the service. In order to ensure a safe production route, I want to deploy models with zero downtime and can use multiple deployment strategies like shadow, canary, and blue-green robots. In order to solve all these problems, Cavesurfing is an open source project founded by Google, Seldom, Bloomberg, and Microsoft and IBM on the CooFlo. CooFlo was the perfect meeting ground for these companies. Cavesurfing is following all the open source principles to build open and cloud-native serving solutions. Cavesurfing provides a Kubernetes customer source for serving ML models across deep learning frameworks with a simple, intuitive, and consistent experience. It encapsulates the complexity of auto-scaling, networking, house-checking, and server configuration to bring cutting-edge serving features like GPU auto-scaling and canary rows to your ML departments. It enables a simple, pluggable, and complete story for production ML serving, including preprocessing, prediction, and explanation. Cavesurfing codifies the best practice of Kubernetes data guidance. With Knative, Cavesurfing is able to enable serverless inference and automatically scale up and down according to the inference workload. Cavesurfing extracts the common model serving features like request response logging, multimodal polling, batching into a cycle agent so that all integrated model server can benefit from all these features. Cavesurfing departments are immutable. Every new department results in a new version, and the traffic is migrated over to a new version only after the new parts are ready and passing the readiness check to ensure safe production loadouts. It also employs various other route service such as canary and progressive loadouts. The main Cavesurfing components are Istio Ingress Gateway, Knative Autoscaler, and Inference Service. Istio Ingress Gateway routes the external request to the inference service. Each Inference Service pod contains two or three containers depending on the features involved on inference service YAML. The main containers in the key inference service pod are Qproxy, Cavesurfing agent, and model server. After the Ingress, the request first hits the Qproxy, which involves the concurrency limit and timeout. It then goes through an optional sci-fi agent if the request rebounds logging and batching are enabled. The request finally hits the model server for inference. So model server for each framework implements REST or GRPC handlers, and it can load multiple models. So inference service pods can be auto-scaled based on CPU or inference workload with KPA. Cavesurfing 0.5 release promotes the inference service API version from V1 Alpha 2 to V1 Beta 1. It supports a standard inference for TensorFlow, PyTorch, Escalon, SGBoost, and other frameworks. It also provides SDK for users to plug in custom component to while still benefit all the common serving features from Cavesurfing. The V1 Beta 1 API further simplifies and enables a data science-friendly interface and also maintains the flexibility Kubernetes pod template spec provides. In about a few YAML lines, you could describe all the infrastructure you need to get up your models up and running. On the other hand, user can still specify advanced fields we need to. Cavesurfing not only provides a unified interface for control plan, it also tries to standardize the data plan protocol across ML frameworks, which we'll cover later. Since 0.5, we also added a multi-model serving capability to improve the resource utilizations. For TensorFlow models, Cavesurfing uses the TF serving as the underlying model server, which is a flexible high-performance serving solution which supports stable model format and graph depth. TF serving uses the TensorFlow REST and GRPC prediction protocol, which is similar to Cavesurfing V1 protocol. For PyTorch models, Cavesurfing integrates TouchServe, which provides an easy way to serve both eager model and TouchSquib model, which can serve the model without a Python environment. Cavesurfing is currently working with TouchServe to conform to the Cavesurfing V2 prediction protocol. In V1 Beta 1, API, we mainly support three components, predictor, transformer, and explainer. Predictor is a required component, and the fields on predictor is mapped to Kubernetes deployment hot template fields, such as replica, service count, node affinity. On the predictor, user can specify the model framework, which naturally maps to container fields like command, arguments, environment variables. Same applies to the transformer and explainer component. NVIDIA Triton InfoServer provides a cloud-inference solution optimized on NVIDIA GPUs. The server supports multiple deep learning frameworks, such as TensorFlow, PyTorch, Onyx, with both REST and GRPC endpoints. It supports model report with versioning and allow multiple models to run semi-tuners on the same GPU with batching support. On-inference service spec, the storage UI is pointed to the model report, which can contain multiple models. As you can see from spec, we also set the OMP num threads in environment variable to improve the inference performance and reduce the resource contention. As by default, PyTorch spawns number of open MP threads, same as the number of cores available on the node, which could overload the CPU as a default CPU limit for inference service part is one. You can also choose to schedule the Triton InfoServer part on GPUs by specifying it on specify GPU under resource section. You can also choose to add node affinity or tolerance to schedule to a particular node, such as T4GPU. As you may know, Bloomberg's customer service is an essential part of Bloomberg terminal. Ours represented the work very hard to provide the best customer service to our clients by answering questions with high quality and fast speed. The reps are pushed the content to help answer questions in the smart resource window. This use case is powered by our fine-tuned BERT model and deployed onto Caves serving on production. For the fine-tuned stage, the data we are using are categorized and annotated FAQs, which gives us half million question pairs. The problem is formulated as question similarity with inputs of two questions and output as a similarity score. The BERT model is saved using export save the model API which contains complete TensorFlow program, including where weights and a computation. It does not require the original model building code to run, which makes it useful for sharing and deploying. As you know, BERT model at the inference time requires significant computation time and doing this on CPU can be slow, which can take a second, often time can take seconds. This helps them challenge to meet the latency requirement for this real-time use case. Deploying the BERT model to meet all the production requirements is a challenging task. You need to take all latency throughput scaling, house check, save flow out, monitoring, or into consideration. And also how can you scale to serve hundreds of BERT models with your limited GP resources? First, to accelerate the BERT inference, we deploy the fine-tune model on KF-serving with TensorFlow serving component, which expect the tensor inputs. Since the user input gear is the question pair in text, we also deploy the KF-serving transformer for the sentence segmentation and tokenization. KF-serving then automatically was up to call to TF-serving. In this way, you can scale transformer and a particular differently. Why also you can deploy it as a single unit? By deploying the model on GPUs, we achieve 20x speedup. We also experimented deploying the BERT model on a mid-year try-to-inference server, which achieved a similar performance gains. We have tested the performance on a mid-year V100 GPUs with 24-layer FP16 BERT squad model. By deploying the BERT model on GPU, we can get to 15 millisecond, which usually takes three seconds to run inference on CPU. On a single try-to-inference server part loaded with BERT model, the latency increased linearly with the concurrency without batching the request. We can scale up the try-to-inference server instance automatically with KF-serving. As you can see, from the previous table, you get the best performance when container concurrency is low. So here we try to set the container concurrency to one and let K native auto-scaler scale up the inference service part automatically based on the in-flight concurrent request within a time window. By setting concurrent concurrency to one, the inference service part automatically scales up when container reaches to the max concurrency. In this experiment, the perf client generates the inference service request to your model and measures the support and latency over a time window and repeats the measurement until it gets stable value. For example, when the in-flight request is five, it scale up to five parts to handle the concurrency. From the result table, you can see that the average latency for higher concurrency can maintain as low as when concurrency is low. There are some latency spikes which are caused by the part code startup time which includes model downloading and part spawning time. When container concurrency is specified, K native activity is also injected on the request parts to do smart load balancing to make sure container is not overloaded with request more than the concurrency limit. As you can see, the support does not scale linearly perfectly because of the inference service code startup time. One way to alleviate the issue is to cache the model on PVC after downloading from the remote storage so that the models can be directly mounted onto all the parts. GPU autoscaling based on GPU metrics can be hard. K native implements a request volume-based autoscaler which allows you to scale to and from zero both on CPU and GPU. The number of parts needs to scale up the load is calculated by the in-flight request and the concurrency target. If the container concurrency can handle five requests concurrently and your in-flight requests are 50, then K native automatically scale to 10 parts. Batch inference is a process of aggregating inference requests and sending this aggregated request through the DR framework for inference or at once. Cave serving was designed to natively support batching of incoming inference requests. The functionality enables you to use your compute resource optimally because most DR frameworks are optimized for batch requests. You can configure max batch size and max latency on inference service batch section. The CAFSA cycle agent then knows the maximum batch size that the model can handle and the maximum time the agent should wait to fulfill each batch request. On the inference service spec, user can also enable scale down to zero to save compute resource after batch is done. Over the time, you need to both pre and post processing before and after the inference. Cave serving provides the SDK to user to easily implement a transformer which can communicate to model server with a standardized data plane protocol which expects tensoring and a tensor out. Transformer and predictor can also be deployed and rolled out as a single unit. In future, Cave serving also plans to support a few outbacks transformer like text tokenization and image transformation. Cave serving allows chain transformer with any inference server. Here, we can easily swap to use different model server as long as they speak the same data plane protocol. So pre-process handler transforms the raw data into the tensor according to the V2 data plane protocol. And you can see the post-process handler transforms the raw prediction into a response for downstream consumption. The transformer does not enforce a particular data schema. It could be a list of text or images. Deploy a single model on GPU can underutilize the precious GPU resources. Cave serving touch serve Triton or allow co-locating multiple models onto the same GPU in the container. The two requests arrive simultaneously, one for each model. They can be scheduled onto the same GPU and work on both inference computation parallel. Cave serving as a training model, customer source, two-year-old scheduling models onto the inference service at scale, which abstract the way the model placement logic way from the user. The models that are placed on the same inference service customer source can be assessed from the same service URL. With the Cave serving multi-model serving design, we started to decouple inference service and model artifacts. Training models can be placed onto an inference service before it reaches the configured memory limit. Cave serving training model scheduler can spawn new shots for the inference service to host models at scale. It also send health check purpose to each model endpoint to reflect the model deployment status on training models customer source. On training model customer source, user assign the model to a given inference service. User also specify the model story UI and a model framework. So memory's resource estimation is used for scheduling models onto the inference service based on the capacity left off given inference service shot and the schedule automatically spawn new shots to host the new models if all current shots are at capacity. The training model scheduler decides which service shot to place the model and writes the model configuration into the shot config map. The cave serving sidecar agent then reconciles the config map to pull new models and remove deleted models in the model repo that is monitoring onto the pod. The agent then sends a signal to the model server to load and load the model. After model is successfully loaded. On model server, it can then be assessed from a unified service endpoint with the model name specified on the UI or pass. While we develop the multi-model serving, we hit fuel ascapabilitation as well. For a single, it's still ingress gateway. We handle limits about running 2,000 services. What do we want to actually scale to 100K models? Can we support a 100K training model customer resource? We are actually bound by the customer resource limit on SCDD. Even we can deploy 100K models on 2,000 service, we may still hit the limit of the number of squished service we can create on the gateway for routing the models. So the phase two multi-model service to find out these limits. We are excited that KF-Serving 0.5 is released with the V1 beta one APR and Triton with V2 protocol. And here you can check the V1 beta one IFC. IFC doc, multi-model serving is in alpha steps and the next step is to make it a production rating. KF-Serving is an open community. So we love your contributions and you're welcome to join our work by weekly working group meeting. And here's the GitHub and example links. Thank you. I will hand over to David Goodwin for the second part. Hi, my name is David Goodwin. I'm going to be talking about the KF-Serving inference protocol version two. You might have also heard this called the V2 data plane protocol. So before we talk about the specifics of the protocol, let's look at kind of the domain where this protocol is important and is meant to be used. So the diagram here shows kind of a representative way of deploying, I say a containerized inference server. So in the gray box on the right, you see there's this inference server that is capable of performing deep learning or machine learning inferences on behalf of whatever clients there are. And it is inside of some kind of deployment environment. It has access to the model repository, which is holding all the different models that it might need to serve. And there are also maybe some load balancers and some auto scalers, et cetera. Those aren't required. This is just showing kind of a representative example. And then on the left, you see there's some clients. So these are clients that want to directly or indirectly take advantage of some deep learning or some machine learning models. And so they can communicate, for example, maybe there's an ASR, an automatic speech recognition kind of service that they want to use, which will in turn need to do some inferencing. And there could be multiple of these workloads. And again, there can be load balancers and the deployment. The protocol is concerned about how the clients communicate with the workloads and also how the workloads then communicate with the inference services themselves. And the protocol is meant to be a single protocol that can be used in any of these areas. So given that background, for the KF serving inference protocol, why do we need this standard? Well, like most reasons why you need a standard, you want the, say the inference clients, which in the previous diagram were the things on the left that needed to use some inferencing. You want them to be able to talk to multiple different servers or services to increase their portability to make them more interchangeable. And on the server side, you want to allow as many different types of clients as possible, maybe ones that weren't originally meant written even to use your server to be able to use your server or service. And that increases the utility of your service. And of course, you want them to operate seamlessly in all sorts of platforms like KF servings, which is standardized around this protocol. So some of the high level of current requirements or support both ease of use and high performance, which we'll talk about some, they need to be extensible. So the KF serving protocol talk about is, has a core protocol, which is kind of the required part, but there's a extension mechanism in it that allows either for in the future, there to be a standard extensions added to it. So optional parts of the core or you could have a server, we'll see the server specific ones, which are individual inference servers can decide to implement their own extensions. And there needs to be both a GRPC and an HTTP REST and a JSON based implementation for this, which we'll look at also in a minute. So I mentioned there's the core protocol and the extension. The core protocol, which we'll go through in some detail is required for all conforming servers. So if you want to, if you want to write an inference server and you want to say that your inference server supports the KF serving inference protocol, you need to implement all these. And they include in points to understand if the server is live and ready to get kind of metadata about the server. And then for all the models that are available on the server, all the deep learning and machine learning models that are available, whether the model is ready and metadata about the model. And of course, the primary reason for doing all this, there needs to be an inferencing endpoint to allow you to actually perform inferencing. And then again, extensions are optional and currently the standard doesn't have any, there's no standard extensions. So there's no optional parts of the core. But we'll see there are, some extensions have been implemented by specific inference servers. And in particular, we're going to look at some, the Triton inference server has implemented some extensions and we're going to talk about those briefly. Okay, so the liveness and the readiness endpoints. Together you could think of these as the health endpoints. And on the server side, there's two different endpoints that you can see whether the server is live and or ready to receive requests. And so for instance, in something like Kubernetes, you can directly use these for liveness probe and readiness probe. And on the model side, there's just a ready and that can be used to indicate if the model is ready to receive requests. And so this is some example here. Again, there's an HTTP restyle version of this protocol and it just returns for these returns that uses the HTTP status code to indicate whether something is live or ready. And here we're asking if the server is live and you get an okay back. So that means the server is live. GRPC side has, you know, it's very similar. Of course it's using the RPC style, but there's a server live and it would turn a bull a true frost response that would indicate whether the server was live. All right, so let's move on metadata. There's another core part of the protocol. And the server metadata is, you know, things like name versions, what extensions are supported. The model metadata is more interesting. You can kind of ask, so what, you know, for a given model that's on the server, you can find out what versions of the model are available. The protocol allows, you know, the server to have multiple versions of the same model. And then, you know, most importantly, maybe what input tensors are you required to send into the model and what outputs can you get back? And for the inputs, you can see here, I'm showing for this example and going forward here, I'm just gonna kind of show the kind of the JSON HTTP examples. The GRPC, protobuf-based GRPC examples are very similar. It's just a kind of different syntax. So I'm not gonna bother to show both. So here you can see from the URL we're using, the slash of E2 is slash models, slash ResNet gives us the metadata for the ResNet 50 model. And we see that there's just one version, version one is available. It's a TensorFlow graph def model, which is just the type. So the platform is a type of the machine learning or deep learning framework or representation for this model. That's kind of informative. One of the advantages of the protocol, of course, is that it's the same protocol. You use the inputs and output tensors and you communicate with the servers the same way, no matter what the actual underlying machine learning or deep learning model is. But for informational purposes, it's there. Inputs, in this case, there's only one input and one output on this model. And you get the name of the input, the shape and the data type, which is very important, because this allows you then as a client, you can then, if you want to discover, well, what kind of data shape of the data is expected and what data type, you can get that from the metadata. And similarly for the outputs, it'll tell you what type of output tensors are going to be returned to you and their shape and their data type. So, and now for the, again, the primary endpoint or API that's in the protocol, of course, is for the inferencing. And again, this is just showing the HTTP, the REST style. But here's the endpoint. Again, we saw before V2 models ResNet 50. So for the ResNet 50 model, we want to do an inference. There's that infer there. And in this case, this is the ResNet 50 model. We're going to send in, this is just a picture I picked. And so ResNet 50 is, if you're not familiar, as an image classification. It was a very famous image classification. It was a very famous image classification, deep learning model. And it takes in an image. In this case, you can tell just by the shape. This is a 224 by 224. So a very small image with three channels. And it'll do a classification. ResNet 50 will do a classification and tell you what that image is. And in this case, ResNet 50, it's been trained with the ImageNet dataset, which there's like a thousand different objects it recognizes. And one of them you can see from here. So we, on the left here, you see the post, post this, we send in the input and tell the shape that we're sending in. And the data here, I didn't list all the data because we're going to talk about that. There's a lot of data. So there would be just an array of the, JSON array of the data. And there'll be, you can tell from the shape, there'll be 224 times 224 times three elements in this data array, which is a lot. And then the output coming back, this is what the protocol requires, the format that's required by the protocol that the server needs to respond with. You know, repeating back the model name and version and then giving you the outputs back. And the outputs in this case, you can see the name of it there and the shape. It's just a single element. And in this case, it's a single element that tells you the classification of this item. And it's a coffee mug. So it did quite well actually on that classification. And so here, by looking at this example, we can see, you know, the inference protocol like the other ones I just talked about earlier, it's very simple. Very simple. It's just the necessary kind of basic information that needs to be sent to the inferencing server and back. And all based around having input tensors and output tensors with their data types and shapes. So that's the core protocol we just talked about. Again, the required core parts of the protocol that every conforming server must implement. And in doing that, in implementing that, you'll have a server that can talk to any client that also implements protocol for any type of deep learning, a machine learning model that that supports and send in the tensors, get back the results in kind of a standard seamless way, which satisfies the primary requirements of the protocol or any kind of standard protocol for that matter. So Triton Inference Server here. Triton Inference Server is an open source inference server. I have a reference link later on if you're interested in taking a look at it. But it implements both the HTTP REST and the GRPC KF Serving Inference Protocols. It's a multi-framework, multi-model. It's both GPU and CPUs. So multi-framework, I mean, if you have a TensorFlow or a PyTorch or an Onyx model, again, that all those types of frameworks are supported for inside the inference server. It implements, again, the core. So it's a conforming server. It also has some extensions. And then so briefly I'm gonna talk about extensions here. So the core is completely sufficient that we just talked about is completely sufficient for making an inference server. However, especially in the area of performance, it can be lacking, especially for the HTTP protocol or primary HTTP protocol. So the extensions that Triton implements, there's some per model statistics. So you'll notice in the core protocol, there was no way, say, to find out from the server how many inferences possibly had been performed for a particular model or how long, on average, were those taking that kind of statistics like that. Triton adds an extension that provides that information. Also an extension for model repository management. So not only can you query which models are available on the server, but you can load and unload those models or load new models and unload models. Also as an extension in Triton, there's support for a stateful inferencing, which there's a lot of language models and other models that are implemented, say with an RNN type implementation where there's state in the model. And so you have to, the order of the inferences received is very important. There's support for that for performance. Now the last two items are primary performance bullets. Tensors, as we saw in the example above, you actually send the tensor data is part of the JSON message. And that can be very expensive, sending that over the network. And so if you're communicating to the Triton Inference Server on the same system from another process on the same system, you can instead communicate by system shared memory or by GPU shared memory. And that then you don't have to send the raw data over the, actually over the network, even local network. Or you don't have to encode it, say into JSON or into a GRPC ProtoBuff. You don't have to do that in coding decoding. You can just access it directly in memory. And similarly, the last one, which is just an extension that only applies to the HTTP REST part of the protocol is a way to communicate tensors using binary data still going over the network, not using shared memory, but a much more efficient way. And we'll talk about that in the next slide. So, and I call this one out. So for all the extensions I just showed here, again, these are extensions. They are optional. And so the servers don't have to implement these. In fact, they're not part of the standard anyway. They're just Triton specifics. Servers could implement these. They're Triton's open source. These are the specs for all these is published. And again, I have a link later. But again, they're optional parts. However, in particular, I want to call it the HTTP JSON, the HTTP REST, the JSON implementation for high performance. If we look at, if we go back on the, we have this binary data extension in Triton, which resolves a very important problem. So if you look at the code on the left or the output on the left where we post, this is the same what we did before where we posted, we sent in a single small image, a 224 by 224 with three channel image, which is quite small. There's still 150,000 FP numbers in this data array, which of course I didn't list them all. And so that's a lot. And that's just for a single small image. And the problem is that this is very readable, right? It's very usable. I mean, it's easy to generate this JSON in a lot of different clients, a lot of different languages, a lot of different toolkits can already do all this stuff. However, to do that, you first have to, so if you have this image or this 224 by 224 by three set representation of this image as floating point numbers, you first have to basically encode them into strings. So print them basically. So that's time consuming. And then you send them over the wire to the server, which is on the other end. And then it has to read this JSON, it's basically just a string, right? So it has to actually parse these strings into floating point numbers, 150,000 of them just to do your inferencing. And so that becomes a huge bottleneck. Basically makes for any, unless your model has very small input and output tensors or if you just don't really care, this is a research model or something on the side, you don't perform, it doesn't matter to you, then this is not really acceptable. It's not really useful for our production kind of deployment. But still the advantages, again, going to the ease of use, this JSON is very easy to generate and there's a lot of advantages to doing it this way. But so let's combine those. And so this binary data extension does just that. So on the right, we can see using the binary data extension, basically the header part, the JSON part is more or less the same, except you have this extra parameter in here, which you basically say that, hey, I'm not giving you the data here in JSON, it's just, but I'm just telling you how big it's going to be, which is a little bit redundant, but that's done intentionally just to be redundant. And then you just, after the JSON message, is just the raw binary data. And there is an extra header required in your post, so the server can figure out where the JSON metadata header is, separating it from the actual raw data. But in doing this, now there's no more, you're still sending 600 and 2,000 bytes of data, you can't, that's kind of unavoidable, that's actually your image, you have to send it over. But there's no encoding or decoding overheads. The data can just be pulled right off the wire and kind of sent right into the model. So, and so for instance, this is just one data point running on a local network, so actually it's minimizing the, it's kind of removing the network part of this. If we send 128 of these requests in time that, just to kind of even it out and make sure there's no blips, doing the binary data really has like a 17x speed up, a huge amount. So you kind of remove that bottleneck from the problem. So that again, that's an important example to show how the protocol, through its extensions in this case, but in general, the protocol does allow the usability and because the flexibility of the protocol, you can't have extensions like this that also allow you, along with say the shared memory to get great performance as well. And so as I mentioned, there's a couple of references here in closing. One is to the protocol itself. You can find it in the KF serving GitHub and comments are welcome on that. Take a look at it. And then for Triton, this is a link that you can see from this GitHub, you can find actually the full Triton repo, but this link directly takes you to the extensions and the inferencing protocols that Triton implements. Thank you.