 My name is Jose Navarro, and this is Briana Galiz, and we both are the machine learning platform team at Cookpad. Today, we are going to talk to you about how NVIDIA Triton inference server can help you optimize three areas of your inference engine. Performance, user experience for your machine learning teams, and cost. But first, let me introduce you to Cookpad a little bit for context. Cookpad is the largest online community for home cooking lovers. We are making everyday cooking fun. But why? Well, the act of eating has a major impact in everyone's physical and mental health. But also, the choices we make when we cook has also a big impact on our planet. And with those two in minds, we believe that there is a big difference between creators and consumers. When you are creating or you are cooking, suddenly your awareness starts to grow. You start to care about where your ingredients come from, or how the taste changes if those ingredients are in season or they are produced in a more controlled environment. And when people start caring, they tend to make better decisions that impact not only their health, but also our environment. We are an online community, a global community that is available in more than 70 countries and support more than 30 languages, which is important to understand some of the challenges that we have as an ML platform team. We have more than 100 million users monthly, and you can find more than 6 million recipes shared globally. In the app, you can browse recipes from ingredients that are in season to get inspired, or you can follow authors like Craig, one of my favorites, who uploads amazing recipes. But you could also search by ingredients, dish, or cooking process, et cetera. Machine learning, it's more available than ever. In the past few years, with the adoption of ML practices and tools, we have removed most of the pain points to deploy ML in production. And that means that more ML teams have moved from working in isolation, creating proof of concept towards being deployed and distributed in product teams, delivery teams, or feature teams, whatever the name your companies give these types of teams. As a result, more ML has moved now from running heavy batch jobs in the background towards running more and more online inference. And with more ML complex models available, that means that the infrastructure requirement to run online inference in GPU comes along, because we need those inference to run smoothly and quick for good user experience. Also, as a machine learning platform team, you probably don't want to lock your ML teams to use a specific ML framework to simplify your inference server. So multiple ML framework support is a requirement, while keeping the user experience for ML engineers easy for them to deploy new models in production. In the next few minutes, I'm going to tell you how Triton inference server can help you improve the performance while keeping the user experience simple for the ML engineers and also reduce the cost. And I know what you're probably thinking, he's going to start talking about cost and you care about performance. But I promise you that this is going to be also a key element to performance. In my personal experience, and also talking to other ML practitioners, running ML in GPUs is not particularly difficult. You add the right nodes to your auto scaling groups. You add the right tolerations to your deployments, and then request a GPU, and you got it. You've got your application running in a GPU. The challenge is that you have to balance the user value that you are adding by deploying a new feature with the business value that you get in return with the cost, which could be quite high. And why is that? In this simplified example, let's say that I want to deploy a model application into a CPU node. I deploy my application. I request an amount of resources. And if I am done my homework correctly, my application will utilize a good portion of that, leading into a well-utilized and healthy cluster. And what happens if I want to deploy a new model? Well, since CPU and memory are resources in which you can request a portion of, your new model will share the same node than your previous one. So as a result, you have added new value to your users by deploying a new feature. You have probably increased some business metrics, and the infrastructure cost has maintained stable. Happy days. But what happens if your model requires GPU for inference? As we have listened today before, GPUs resources, you cannot request a fractional amount of them. So if your model requires GPU, you have to request the full GPU for it. And that means low utilization. But OK, let's say that the feature is worth it. You pay the price, and then you deploy your model. What happens the next time you want to deploy a new model? Exactly. You have to request another GPU. And remember when I said before that we are a global community in more than 70 countries and supporting more than 30 languages? No matter how simple the feature we want to deploy is, there is very little chance that we can train a model that will perform well in all of the regions or all the languages. So that means that for every feature that we want to deploy, we end up having three, four, five models for it to cover for the most popular regions or languages. A decent GPU card in our cloud provider is around $3 an hour, which will take you to over $2,000 a month. And if you have to deploy four or five every time you want to deploy a new feature, we are talking about considerable money. And if you are thinking that maybe we could scale the cost by using multiple GPUs environments, the cost of a node with four GPUs is exactly or very similar to four times a single GPU. So you can even scale the cost that way. However, if you deploy Triton inference server on a GPU, you can concurrently host multiple models in the same GPU. And if you deploy it in an environment with multiple GPUs, Triton will replicate those models in each GPU so that it can balance the inference compute in each of them to maximize utilization. You're probably wondering what this means. If it's a GPU on a walk in the countryside or you see GPU deploying Windows XP, well, what I'm trying to say here is that Triton has enabled you a happy path to deploy models in GPU in a cost-effective way. And that is the first performance gain that you can get with Triton. Every time you deploy your models in CPU because GPU is too expensive, you could migrate them to GPU. But Triton also comes with a few other options for optimization. If your model allows batching, you could easily configure dynamic batching for your model so that Triton accumulates a number of individual requests and then build a larger batch that will compute more efficiently than doing it individually. Dynamic batching can also be configured by you could select a maximum amount of batch. You could add a maximum delay so that Triton will run the inference of the batch as it is if it reaches that delay. But more like you could preserve the order so that Triton responds in the same order that the requests arrived. And moving on from dynamic batching, another interesting optimization is model instances. You can easily configure Triton to replicate an amount of times each model in a given GPU. That allows Triton to overlap the transportation of data from and to CPU and GPU and overlap with the inference compute. Finally, you could also combine the both into your model so you could configure dynamic batching and model instances. Before, I also said that Triton will help you to improve the user experience. In the worst case scenario, very simplified, our ML engineers, if they want to deploy a new model, they will have to create several Kubernetes resources for it, a deployment, a service, a config map, service account to give some permissions, HPA to make sure that we scale dynamically under certain circumstances and a pod disruption budget to make sure that we always run a minimal amount of replicas. But allow me to repeat myself when I said that we have a challenge with the amount of languages and regions that we deploy, that we support, so that for every feature, we end up replicating those resources. Well, it's not that bad, really, because using an open source tool called Customize, we are able to template the base of the application and then deploying new models will result in just patching a config map, et cetera. But that doesn't resolve all the issues because if you also need to modify the resource allocation, depending on the region, or more and more, you end up patching more and more files, and it's a bit complex. Now, with Triton, our machine learning engineers don't have to worry about what ML framework they used to train their model because Triton supports all the ML backends that we desire, like PyTorch or TensorFlow, TensorFlow RTE, ONX. They only need to package the model following a layout so that Triton understands what type of backend it needs to use, and that process is documented in Triton. Also, once the model is available in the given packet, they only need to create a PR to modify the right config file, and our cluster automation will do the rest. Is this working? Here we are. So to summarize, big thumbs up for NVIDIA for creating Triton inference error, who has enabled us to improve the performance of our models, sharing resources in GPU, and improving our GPU utilization while keeping the user experience easy for our machine learning engineers. And with that, I'll leave you with Pry who is going to demo some of these features. Thank you very much. Thank you. Thank you. This will not be much of a demo, more of showing you how we deploy our models on GPU and how it performs versus our previous deployment, which is on CPU. We will be using Apache Bench for a simple benchmarking for this section. We'll first show you how easy it is to deploy a model on Triton. And even before that, a little bit of context for our session today. We are going to measure a PyTorch-based image transformation model. So it accepts an image data, and then it returns that image representation in embedding space. So the input will be a 244 by 244 by 3 multidimensional array, and the output will be 300 floating point values. And for those of you who've deployed a machine learning model before, in some cases, GPU is not even necessary for real-time inferencing. But the one that we are going to show you is the kind of model that get the most benefit when deployed on a GPU. And this is actually one of the models that we actively use at Cookpad. So the setup is that for the CPU best deployment, it will be a simple Python service. It will load the model using PyTorch. We are going to deploy it on a compute optimized EC2 instance. So we got a BFIR CPU because model deployment is mostly a CPU-bound task. And for the GPU-based one, we will drop the model in Triton. But as you can see, we are adding a front-end Python service in front of it just to make it easier for Apache Bench to hammer it. This, though, adds another network overhead between Apache Bench and Triton. But you can see next that it's not significant, and also, in real world deployment, you usually need a place to pre- and post-processing the input and output of your model. And in fact, this is what we end up deploying all our services at Cookpad. You're probably familiar with how to load a model and deploy it using Python. So we're moving next to how to deploy it on Triton. And a few slides ago, Jose showed you this. In Cookpad, our teams have to provide their own manifest when they want to deploy a service. And there are some tools available to abstract that for you, but we're not currently using that. The reason for that is a whole new story. And for Triton, because we as a platform team provide and manage the deployment, we've taken those manifest out of the equation, and then what's left for every team want to deploy a model is that they just need to package the models, put it somewhere accessible. In this case, it's our private S3 packet, and put that S3 URI to that package in our shared config. And one thing to note is that we need to package each model with a structure that Triton can work with. And it looks like this. So it's very simple. Those who've worked with TensorFlow surfing probably familiar with the structure of TensorFlow. This is what came out if you do a safe model using TensorFlow. At the root is the name of the model you want to expose it as. And the next is the version number of the model. And then the model file itself. You can put any model you train with multiple kinds of platform here. And this is the minimum that you need. Excluding that config.pbtxt, that's optional. Triton will try to make sense of your model and then generate that file for you. But it's there if you want to make an explicit configuration of your model. For example, like this. So as mentioned before, you can deploy a model from multiple platform, multiple framework. Triton supports TensorFlow, PyTorch, Scalar model. And one of the models we use is that TensorFlow RT, which is an optimized compile version of a model. So you feed to provide this at the model level and Triton will load it. After you specify the platform, you specify the input and output signature. And this is also where you're going to put your model's inference tuning config, which we'll see later in the presentation. OK, so compress the whole directory and then upload it to S3. And what they need to do next is just add this line, specifying where the model is, and boost the changes. And then let your CI CD process that file, which is actually, in our case, it's a repository full of Kubernetes manifest. We use flux in our case to synchronize those and wait for it to roll out. This is usually when I go to procrastinate and browse stuff on the internet, but you don't want to see me doing that today. Now we're going to see how they perform. So a simple test at the top is the CPU deployment, and at the bottom is GPU deployment. It takes quite a bit of while, so let's just skip. So this is the baseline number. We start with no concurrent requests at all. This is single-threaded. These numbers are as fast as we can get out of these services. It's already fun looking at this when we first deploy Triton. We'll try next. And remember, this is including the network overhead between our front-end service and Triton. Now let's try with two concurrent requests. It'll take a little bit longer than the previous one, so we'll skip again. And this is a result. Not much changes from Triton, which is good. But as you can see, the one running on CPU already doubles the latency. So then a little bit more work. Now we are going to look at how the resource utilization with these two services. So we're going to leave the single-threaded benchmark running for a bit longer in the background. And this is what we get. Let's practice down. At the top, you see the CPU first GPU and memory usage. As you can see, on the CPU, it uses one, which is 100% of a single CPU available for that deployment. We don't limit the CPU resource for this one, so that's the most they can use. The reason being, if you do an inference on CPU, that's they only use one CPU at all at a time. And for GPU, we are only using 50% of a single GPU. So it's a little bit of room there. On the memory, we see the reverse of that. So the CPU only uses around 200 megabyte, but Triton uses almost twice that number. But that's FI RAM, FI memory, and not RAM on the server itself. So it's not Apple to Apple, but you get a sense of how you should plan your capacity with this. And the fun part is, at the bottom, we get almost 10 times throughput and then 10 times faster latency out of GPU. So this is fun. Double the resource from memory, half the processing time. Sorry, processing capacity. But you get 10 times the throughput. Also, of course, it's not all magic. At some point, you exhaust the results. As you can see here, we are starting at two concurrent requests. GPU is at 100%. Request rate already maxed at 100 requests per second. And each time you double, which you can see at the two arrows there, latency will only increase because requests are waiting in queue. Well, to be honest, at this point, this is already good for us. Most people can just wrap it up and then move on to the next model. But I do understand that we need to scale. From this point, you have multiple options. So the easiest one is horizontal scaling. That's self-explanatory. The other one is four-decker scaling. This is using single GPU on a node. If you use multiple GPU, Triton will spread the models on whatever GPU available. So you get parallel processing. The other two is the one that Jose mentioned previously. You have two options. You can specify an instance group. So you deploy multiple instances of your model in one GPU. That is, if you have a room in the GPU capacity. So Triton will try to provide multiple instances of the model, and then you get parallel processing. The other one, if your models allows dynamic batching, that's the next thing you can use to squeeze more throughput out of it. So you can specify the conflict on the right to tell Triton to pull all the incoming requests and then execute it in one pass. That's benefit in memory data transfer between the server and the GPU. And that's it. You can also combine these techniques to get more out of your deployment. And that's it for me. That's it from us. Thank you for your time. Next slide. Thank you. So it's time for Q&A. We are hiring. And also, you want to mention about? Yes, so we will be having some drinks later next to the city of arts and science. So if anyone wants to join us to talk about ML, cooking, or anything, just come and see us so we can send you an invite. We've got some budget from the company to pay the first round, so. That's good to take questions. Yeah, yeah, yeah, of course. It's time. Hi. Which backends are currently supported in Triton? You've got PyTorch, TensorFlow, TensorFlow RT, Onyx, SQLer. You can use that as well at SQLen. But in order to run PyTorch on Triton, you need to convert the model to Torch script or Torch serve, something like that. You know, did you did something like that? You need to convert your model? So for this demo, OK, sorry. For PyTorch, for this demo, we tried two stuff, actually, the Triton itself and then the TensorFlow RT. Yes, you're correct. You need to, for TensorFlow RT, you need Torch model, PyTorch model that supports Torch script. So if everything works on TensorFlow, basically you can use that. Is it working straightforward? Do you have any problem with that? Identifying which part in the graph that use TensorFlow or not, that's the hard part. But once you isolate them, you can take it out for pre-processing and then move the pure TensorFlow operation into this. Thank you very much. Any other questions? So Triton inference error is actually one of the few servings that support mixed precision on TensorFlow. Have you done any benchmarking on the mixed precision versus the TensorFlow RT in the float point 16? Let me repeat that. So benchmark? The difference between the floating point 16 operation on TensorFlow RT and mixed precision on TensorFlow, because they are kind of comparable but not entirely. Have you done any benchmarks between those models? Yeah, yeah. So what we check for floating points, the different floating points, 16 versus 32, for example, what we do is that we check the difference between the output and then we compare it. We find the difference between the two and then we kind of compare the performance of the output and then come to a value that we accept. So there are differences. So I believe for this particular model, it was 10 to the power of minus 9. And that's if you don't reduce the precision of the floating point. So just by floating point 32 to 32 from PyTorch to TensorFlow RT, you get that much difference. And what about not like the accuracy differences between the floating point 32 and floating point 16? Were you like, what was the scope there? It's not much difference. It's not very significant, but I'm not exactly remembered number. OK, thank you. Any other questions? Hi. Thank you for the talk. So I'm wondering, there are more inference surfing frameworks out there. What made you guys pick the Triton inference server? Yes. So we found with Python inference the ability to host multiple models in a single GPU, being the killer feature for us because that allows us to share the cost between several different applications so that the decision between deploying a feature or not comes back to just understanding the value that you add into the users and the value that you get back. Whereas the cost element of the decision has been reduced. And because now you can share the cost with multiple applications as we do with the CPU workloads. And also that it supports multiple frameworks. That's why we choose Triton because you cannot force your machine learning team to just use one framework that have preferences. So just in this case, we initially have PyTorch and TensorFlow deployed on our cluster, just combining that into one instead of having to manage two deployment stack is beneficial to us. Thank you. All right. Thank you very much. Thank you.