 My talk after Penny, I will just take a photo of my dear sister, some of that in America. We're speaking about the plan machine learning at scale with things like the standard and Kubernetes. I think we can do that. We can compare it with Kubernetes to deploy machine learning at scale. About me, hello. I'm a high school student and I've been working particularly with machine learning, especially computer vision. And I've contributed to multiple projects, including what are not limited to TensorFlow, Kubernetes and more. And you'll often find me working on some cool open source projects or contributing to some open source projects. So that's about me. And let's get on with the topic. And today we'll be talking about making machine learning deployments because the truth is most models don't get deployed. And this might be due to multiple scenarios. I will not be covering all of them today, but a lot of models don't get applied. And if you put numbers into this, as it turns out, 90% of models don't get deployed. I use some practices also due to machine learning just being an iterative field. So you probably need to try multiple times. But a lot of this number also comes from not having consistent ways or there being multiple hurdles while deployments. A prime example to prove that point is this image. And this image actually shows that a part of the modeling code, that is all the code you write to create and train a model. There is a lot more that goes into properly deploying a machine learning model. And sometimes all of this is harder to do or is more time consuming and writing the model training code or training the model itself. And this diagram is actually an app representation. There are multiple steps you need to do in order to deploy a model. And modeling code is just a tiny part of it. As you can see the highlighted square in the middle. So modeling code is just a tiny part of it. A really big part is the model serving or the serving infrastructure part, which is particularly what we will be seeing today. How you can easily have serving infrastructure up, scale it as required. And what does it mean to have a serving infrastructure? What all do you need in a proper serving infrastructure? So that is what we'll be talking about today. So the first question I get to answer is why you should care about machine learning deployments? Well, there are a lot more tasks at hand, which are probably not there when you are just training your model. And we'll take a look at some of those tasks right now. What do you need to actually make a deployment? And why is this process so different from let's say all the other parts collecting the data for a model training the training the model? And why you can't probably have, let's say, you do the experimentation in a Jupyter notebook for a machine learning model. So why can't you have a Jupyter notebook out there for deployment? So we'll talk about all of this right now. So with that, we want to first talk about what all things to take care of or what all things you actually need to make a successful deployment. And the first thing you actually need to do is being able to package the model. And what this means is you ideally don't want to have a Jupyter notebook out there where you do the deployment. But what you want to do is export the model in a way that is in a way that the model is optimized and you can make predictions from it. The users can use the model. So the first thing is, of course, we need to package our model in a way. The users can use it probably not just some directories of experiments, of course. And next up what we need to do is host the model on a server. So we want users, let's say an Android user or someone might be calling our model from an Android application or from a website or wherever they want. So we should have an API and host the model on a server. Once you host the model in a server, then comes a part of the server. And among a ton of things that needs to be done to properly maintain the server, a few of them are actually auto scaling. So if you suddenly end up having a ton of users, then you need to be able to scale your compute. You need to be able to serve all of those users. Your model should be able to run for all of those users. You also need to ensure global availability. You should be able to serve your model easily to request being made from any part. And you should do this with a little bit of latency as possible. And these are just some of the problems you face while maintaining the server. But just to give you an highlight of what all things you actually need to make a successful deployment and why it is so much different from just training the model. So now that you have hosted your model on a server, you also need a consistent API to get predictions from it. So in this, we'll actually talk about. So when I said deploying our models with the effects and Kubernetes, I actually meant deployment. So what you're in cool is you want to have an API and that API can be used to get predictions from a model and the model will probably be running on a server. So not running the model on device on the client side or something like that. Those are also their own kind of deployments. And actually yesterday at open source Summit Latin America, I gave a talk about making edge deployments as well. But this talk is not about that running the model on device. And well, that's a nice use case. But a lot of the use cases are also about running the model on a server and just using an API to get predictions from the model. That is what we'll be talking about today. So we want to do that. So we want to have a consistent API to get the predictions from the model and be able to use the model virtually anywhere. You also need to do model versioning, because as we talked about machine learning is quite iterative process. And let's say tomorrow you get feedback from your feedback from the devices you run the machine on the devices from where you call your machine learning model. And you may need to retrain the model a bit or you may need to change the distribution of the data the model is trained on or something you might need to change the model. So what having consistent model versioning is also a prime importance. And finally, among the multiple things we also want to be able to do batch predictions. And this is also like some of the other it's also something very specific to machine learning applications. So what batch predictions means is essentially executing multiple requests together and we'll take a better look at batch predictions and how you can do them, but they are especially useful. So if you want to conserve your if you want to conserve your hardware, if you want to use your hardware resources in the best way possible. So we also want to be able to do batch predictions with the API. So, there are actually a lot of things you need to take care of while deploying a model. And these were just the, these are just some of the main things that you need to take care of the model, and there might actually be a ton more which is also take a look at it sometime. But these are some of the main things you need to take care of while deploying a model. And it's actually quite, and it actually seems quite a lot of work. Is it quite a lot of work. I will not we'll see we'll also see a demo of making a deployment with all of this in place so you can support all of this and making a machine learning deployment. But before I actually talk about how you can do this, I want to, I want to go over simple deployments and also talk a bit about why they become inefficient or let's say larger models or when you get a lot of users or you get a lot of model versions, why do simple deployments become inefficient. So, first of what do I mean by simple deployments. It could be something as simple as flask, using flask to deploy a model. And here what I actually do is I have a flask application. And a lot of functions are not written over here. This is a highlight of the classify function which will do the actual prediction. And this assumes that you have written, you have written down a preprocessing function and all of that just a complete code, just some template code out there and not complete code of course. But this is a very simple flask application that so that has a route called slash classify it takes a post request gets all the data and whatever the model sense as an output. So it runs the model on it, whatever the model sense as an output, it simply puts that out in Jason. And how will there are, and this is what I'd say as a simple deployment and will there are multiple caveats with this. First of course is you have no consistent API. So let's say tomorrow if you wanted to create an API route in a completely different and inconsistent manner, there is nothing stopping you from doing so. You could have an API route which is completely different from the rest of from the rest of your route. There is no consistent API for this. And also this does not support model versioning, you will probably have to manage model versioning manually, which is a ton of pain or use some other tools with this. This is not support model versioning or also having to do something like, let's say a new version of the model is out, but some users might just want to use a previous version. So support for things like this and efficient model versioning auto pulling the model from let's say a GCP bucket or an Amazon S3 bucket. So things like that are also not there is no support for model versioning. Also, we do not have mini batching support. In a sense you could manually write down mini batching code. And not only would that be, not only with this kind of deployment would that be inefficient with a ton of work as is. I have actually never tried mini batching in a flask application. But also this is not support mini batching. And as I need less to say this is actually quite inefficient. If you have a large model. So, so this was an example of simple deployments and why often they do not often why they are inefficient or larger models are when you have, or when you have a lot of requests coming in. What I want to talk about is how you can get over this. So how to how you can deploy a model without worrying about having accounted for all of the things we mentioned earlier. So with that we come to TensorFlow serving and I introduce you to TensorFlow serving in this stock and we'll and we'll also see a demo of it later. So TensorFlow serving is a part of the TensorFlow extended ecosystem. So the TensorFlow extended ecosystem allows you to deploy models and do all the deployment related processes very easily. A core part of that is actually TensorFlow serving, which allows you to do the serving part or handling the API is managing all that goes on with the API is and do do that part the serving the model serving part very easily. So TensorFlow serving could actually help us. So the TensorFlow extended ecosystem is actually use internally at Google for their products as well. And as I was saying earlier, this actually TensorFlow serving plus Kubernetes actually makes our deployment a lot easier, especially when TensorFlow serving is paired with Kubernetes. So you could also get around many of the common, so you could easily approach many of the common problems that deployments face auto scaling latency, getting services, managing services, managing load balancers and all of these. You could easily do that with Kubernetes and manage them with Kubernetes. I would assume a lot of you might have already tried out Kubernetes and and I hope you might have logged out how easy it is. So something like TensorFlow serving paired with Kubernetes will make your machine learning deployments particularly very easy while also giving you the benefits of while also you can have the benefits of TensorFlow serving so that's consistent API is in-house support for mini batching what is working. It's essentially built for machine learning deployments. So you can make the best use of that. So before we go ahead, we'll now start talking about how you can use TensorFlow serving and how you can put it in a Kubernetes cluster and later in the demos will also take a look at how you can put it in a Kubernetes cluster, create load balancers and actually make your prediction using TensorFlow server. So if you remember what we are trying to do, the first part that we want to do is actually export the model or package the model. So what we'll do for TensorFlow models, that is the example I'll be taking today, but you can most certainly use any other frameworks as well. So you can use models in the Onyx framework with TensorFlow serving as well, but for today's talk I'll just be talking about TensorFlow models. The process is pretty similar and you can use most of the insights from this talk or you can use most of the learnings from this talk for your other framework deployments as well. But for this talk, just for example, I'll be using the example of having a TensorFlow model. This can be anything. So the first thing we want to do is export our model because we just don't want experiments. We want a model which you can influence from. So TensorFlow actually has a saved model format, which makes it easy to export a model. And this essentially takes the dynamic acyclic graph and puts all the graph definitions as protocols. And it puts the graph definitions as protocol papers essentially. So what the saved model directory essentially consists of is the assets directory which contains any auxiliary files. So you might have vocabularies in the assets folder. In some cases there can also be an extra assets folder for some saved models, which is not loaded by the graph and you can have any high level files over here. We also have the variables directory which as the name suggests consists of all the weights for the model, the variables for the model. And finally, we also have the graph definitions directory which is a protocol buffer in the protocol buffer format and consists of a graph definition. So now that we have packaged the model, the next part will be to see how you can use TensorFlow serving to how you can use TensorFlow serving to deploy this model. So this is actually a very simple example of how you can run TensorFlow serving locally. Of course, you'd want to put it in a Kubernetes cluster to get most of the benefits of Kubernetes as well, to get the benefits of Kubernetes as well. But let's just see how you can deploy it locally. So our first is, of course, TensorFlow model server command which uses, okay, so again this is how you do it locally and you also have a container image for this as well, which we'll actually use later. But so you can deploy this locally using the TensorFlow model server command and this actually works in the exact same way as the container image. So first I'll just call the TensorFlow model server command, put in the REST API port. So another interesting part is this also supports REST APIs and GRPC API side by side very easily. So you could very easily support REST API and GRPC side by side in a single deployment. So you can also add support for GRPC over here in the same way. You give it a name for the model and this will also reflect in your API part. So we talked about, we talked earlier about there being consistent API parts. So your model name, the model version you are using, all of that actually represents, all of that actually comes in your API part and the API parts are the way you would call the API would always be consistent. So we'll also pass the model part. So this could be a local directory or you could very well have it take the model from a Google GCP bucket or this could also be S3 or essentially any storage. So yeah, you could very easily grab the model from there as well. This also supports polling and we'll see how it does that and how it does the model versioning part a bit later. But this is the template for a very, very simple deployment. Right now this deployment only supports REST API and just add another line to do this for GRPC. But this is a very simple deployment. And what this already has done for us is it gave us consistent APIs. We have consistent APIs to call any model. We have the model versioning in place and we also simultaneously support, allows us to support GRPC and REST. So let's say with your deployment if you wanted to inference, you could very easily do something like just make a post call if you're inferencing with the REST API. And you can also specify a particular version of the model you want and you can also get the latest version of the model or you can pin it to a particular version of the model you wanted to run. And as you can see over here, the API URLs are pretty consistent as well. So we have the model name, which if you recall from a couple of slides ago was this simple and plain old test. We have the port. So 8501 is a port where it is running the REST API we defined earlier. So we have that. And we also have the model version you can put in if you want to get the, if you want to get output from a particular version. So this is a very simple example. And I'm just putting in some sample data. It's not even well filled data, but just some sample data and I can simply make a post request to it putting in my URL. So over here, in this example, what I'm doing is I'm getting the inference from the latest model version, not putting it to a particular model version. But as I showed in the earlier slide, you can also do that. So I just started showing an example for getting the latest model version. And then you can very simply make a post request. So that's how simple it is inferencing with REST. And that was, and that was about a very simple deployment. So I talked about also being able to influence with GRPC, which is, which is also, which is also supported right out of the box with TensorFlow serving. And we'll also see an example of this in the demos later. But GRPC essentially allows you to have better connections, since you're, and it's also faster, since your data is actually converted to protocol buffers. And all your request types have a designated type. So, and you'd also use GRPC steps. So under the hood, how this works is all of your data is converted into PF example records. And that is sent to the server. Those already have specific types designated. And another interesting thing you could pair this up with, although that is a bit out of the scope for this talk. But another interesting thing you could pair this up with is TensorFlow Transform, which also allows you to do the pre-processing steps on the server. That is also part of the TensorFlow extended ecosystem, but we'll not talk a lot about that since that's outside the scope for this talk. But it's pretty interesting. You could also pair up TensorFlow Transform with this. But you write out of the box, get multiple benefits with the GRPC APIs. And you can also use this. So, another thing we want to make sure is, again, some of the machine learning specific parts of the deployment, something useful just for machine learning deployments. So, you also have an API to get the meta information. And what this means is, so usually you have a model meta information which contains the signatures of the model that tells you how you can provide input to the model or get outputs on the model. What would be the types of inputs expected by the model and so on. So this is particularly, can be particularly useful for model tracking and elementary systems. Or you can also use it just to understand what kind of inputs and outputs the server expects. So you also have a model meta API. And you can simply send and you can simply send a request to slash meta data to do this. So you can simply send a request to slash meta data to get all the model meta information as well. So next up we want to talk about batch inference and how TensorFlow Serving also supports batch inference. So batch inference, as I was talking about it, the idea is to simply take multiple requests and process them together. So, this would also allow you to use your hardware efficiently, especially if you have some hardware accelerators, and also allows you to save your costs and computer resources. So this is especially really useful if you are using large models, if you have larger models, batch inferences are particularly really helpful for that. So, if you want to add batch inference, it's also highly customizable for TensorFlow Serving. You can have multiple, you can have multiple, you can have multiple parameters which can be configured with TensorFlow Serving, including what the maximum batch size would be, the number of threads, the number of batch threads are usually using, and so on. So, batch inference is actually highly customizable with TensorFlow Serving, and you can put these simply in the batch config and give it to the TensorFlow model server. So, you would simply load the configuration file for the batch inference on the startup, and of course you can change any parameters you want to change according to your own use cases. So, batch inference is also supported right out of the box with TensorFlow Serving, and on the right is a very simple example of how you can support batch inferencing using the parameters we defined in the last slide. So, with that, I come to the end of my talk, and thank you so much for, so what you can also do is, not just batch inference, but let's say with the parameters you also wanted to do something like a request to get a model from a GCP bucket or S3 bucket, wherever your models are. So, you could also pull the buckets, and that is also pretty customizable with these parameters. So, you could let's say check for a new version in your bucket every 10 seconds, for example. And also, and also deploy the new version of the model very easily. And since we already saw how model version, how well model versioning is supported and how consistent it is, this is also particularly easy. And you can customize it with the parameters in a similar way like we did over here. So, next, we'll take a look at some of those. So, onto the demos. So, now we'll be seeing the quick demo of running TensorFlow Serving on Kubernetes. And we'll also see how you can do inference with auto scaling and essentially, I'll show you an example of auto scaling and getting prediction and then you can measure TensorFlow Serving up with Kubernetes and essentially get a lot of the benefits and we're going to be applying the machine learning model, which we talked about earlier, very easily with Kubernetes. So, right now what I have is essentially I have already created a Kubernetes cluster for you. So, this is like the Kubernetes cluster I've created. Right now it's on EKS. So, as your Kubernetes services, but it could be like literally anywhere you could also probably try this demo on your local system using something like Minikube or even kind to make local Kubernetes clusters. So, yeah, you can essentially try this out anywhere. I have actually deployed it on Azure Kubernetes services. So, for this example, what I already have in my Kubernetes cluster, I actually created it beforehand recording this demo, since I wanted this demo to be brief and not this time and waiting for the Kubernetes cluster to be able to deployment to be available. So, I actually just have eight nodes in this cluster and this also, so right now I just have eight nodes, but we'll also see how we can. So, with Kubernetes, you already get the benefit of being able to scale very easily. So, I would like to see how you can manage this up to Kubernetes and get a lot of those benefits right out of the box. So, I already have my Kubernetes cluster up, and I'll just come back to come back over here. So, what I've actually done is, I actually have a couple of, I actually have a couple of deployments earlier. And, oh, sorry, I just have one deployment earlier, I said deployments, so I just have one deployment earlier. So, I'll show you, like, what are services I've actually deployed. So, first up, I have actually put our config map, Kubernetes config map, and this is for our image classifier deployment. So, we just want our part. So, this is a very famous, isn't it, model, we'll be deploying. So, this is the model part. This is on a Google storage bucket. This is for history, whatever you want, as we talked about earlier. So, this is, so these are, and this is the config map, which I deployed. Next up, we have the main deployment. So, this is actually a Kubernetes deployment. And what I particularly want you to see from your is, so this part. This is actually the principle serving part. And if you see we actually get our model name, which over here as well just image classifier, the model base part, this model base part is actually taken from this model base part is actually taken from the DFS serving my last month fixed. So, the Google cloud storage bucket we defined earlier. And, and this also, and this also exposes a couple of services. So, we have the STPT API which will be running on code 8501, and we'll have the GRPC API. So, as we talked about with TensorFlow serving, it is particularly easy to create a rest API and GRPC and GRPC endpoints very easily. You can do, you can do that very easily. So, we have the rest API running on code 8501, the GRPC API running on code 8500. And this is also explicitly requesting for one CPU and two gigabytes of memory. Of course you can change this according to the model size or some other factors. But right now I just hard coded this and put it. So, what we'll also need is, we'll also, we'll also need a load balancer. So, this is a very simple load balancer, which, which also supports code 8500, which is a GRPC code and code 8501 are our rest API code. So, I actually deployed all of these three for you. So, in the same few minutes cluster. So, we don't take a lot of time to get all of these up and ready. But now that I have deployed these. So, what we can actually do is, let's just keep good deployments, keep good deployments. So, I actually have the image classifier deployment up and running. I also have my load balancer. So, I have I have my load balancer service up and running. So, this is the external IP for the load balancer, which we'll be using when, which we'll be using when we are trying to call the rest API or the GRPC API. So, so this external IP is what we'll be using. So, I'll just get this. So, we already have a service or we already have a deployment up. So, what we can do is, let's actually try running an example. Let's try to make a prediction from this. Now that we have actually deployed there is like money. So, oh, so before we go any forward I also want to talk a bit about the model we currently have. So, let's actually do that. So, let's come back one directory and then let's just say rxvzf resnet.dar.json. So, this is, oh, I shouldn't have put it out into the main directory with all the other, with all the other folders here. So, let's make it confusing. But let's fix that. So, we have the variables folder out save model dot pv folder, pfile. Okay, there it is. So, now let's go into CD into resnet and let's do an ls over here. So, let's do an ls over here. And what I see is, so this is the resnet 101 model. So, 101 in simple words is this a way to measure the size of the model. So this resnet model is actually trained on the ImageNet dataset. So that's, so that's an array. So that's, so that's a dataset with an array of, with an array of labels. And it contains labels for multiple objects. So, so the ImageNet model here consists of 1000 labels for multiple common objects. So that is what this model is trained on. And if you recall, we were talking about like the package model structure, what will we actually be deploying. So this is what is actually in the Google Cloud Storage model. We have a saved model dot pv file, which is our graph definition saved as protocol. We also have the variables file. The variables file actually shows our weights. So if I do an ls over here, I only have like one weights file and I have a dot index file. So let's say if this was probably a bigger model, probably a lot. So this model is just around 110 MB. Well, I think this was probably a bigger model. You'll probably have your weights sharded as fights. So right now this shows, this is zero of fun. And if you probably have a bigger model, these weights will be shattered into multiple fights. So in this simple model, we actually don't have the assets or the extra assets that we talked about. It's just all of these. So, for the example, what we'll actually be seeing is this image, this is a pretty famous image, even in the machine learning world. This is a pretty famous image, and we'll try to run in print service. What I have for this is I've actually already prepared. I've actually already prepared the API content for you. So if I just go to the request body. So this is the API content and well, what is all of this. Okay, so this is just the base 64 including of the image we have. So, so this is these are all the instances I want. So this instance means how many images you want to put in. So in my case, in our case, we just have a single image. So this is a single image. And this is in the base 64 format. All of this is a mix of that image you just make some that a great shopper image you just saw. So, so this is like what makes up the image. And let's actually try running a prediction on this image. So I already have a command up here. So I'll just use this and make a prediction. Okay, that was pretty quick. And it actually told me that when this is a military uniform, which is it's, which is the model's first guess, and it's 94% sure. And well, if you, if you remember that image, it was actually a military uniform. So the model was pretty accurate in this case. And so another simple example, and when we can essentially harness the power of Kubernetes, not just do horizontal, horizontal auto scaling but you can do a lot more, you can do a lot more customizations, but I'll just show a simple example. And this is actually like the HPA, the horizontal part auto scaling, a very simple example of this. And all I'm saying is, I will, for this is for the image classifier deployment. And if the CPU percent is maximum you should scale from one to four, minimum one or maximum third. That is all I'm saying when this HPA and this is just a simple example of an HPA. There's a lot more you can of course do now that you have a deployed a TF serving on a Kubernetes cluster. This says it already exists. Well, yes, I deployed it. So if I just take you for me, I actually see that this is a horizontal part of the scaling. And this is also deployed. And this is also deployed. So, this is just a simple example of, so we are actually using the horizontal part of the server when we are making the request. And this is just a very simple example of one of the many things you can do now that we have managed it in a Kubernetes cluster. But that was it for the demo. So we also we saw how we can deploy our model as a Kubernetes cluster which was pretty simple, use TensorFlow serving from the TensorFlow extended ecosystem to do this while also getting some of the benefits of Kubernetes. So that was all for the demo. So with that I come to the end of my talk at open source Summit Latin America. And thank you so much for hearing. Feel free to reach out to me if you have any more questions. Feel free to reach out to me on Twitter at Reshith and that was it for the talk. See you later. I hope to see you in person at the next open source Summit Latin America.