 So our next speaker is Shreya Kurana, actually we met at GeoPython 2019 last year, now with different different conference. She will talk about train, surf, deploy, story of a NLP model, PyTorch, Docker, Uwiski and Nginx. Please start screen sharing. All right. Thanks, Martin. Let me start presenting. Okay. All right. Hello, everyone, and welcome to this talk. Today, we're going to be talking about how we can get a machine learning model in production. And we're going to be talking about certain technologies like Docker, Uwiski and Nginx. Okay. First, a little bit about myself. I am a data scientist co-daddy. I've been working with unstructured language data and I've been models based on deep learning. I am into dancing, hiking, and I've recently gotten into talking. So as Martin said, I actually participated in last year's GeoPython. I also gave a talk at this year's PyCon. And this is an interest that I've recently developed. This is, and so this is why I'm here today. But the other interesting thing that you need to know about me is I am a huge meme collector and I really love the office. And some of that you will be able to see in the coming slides. Okay. So when we talk about machine learning, we usually talk about two things. And they are training and testing. And coming from an academic background, usually we see this whole process like we get the data, we pre-process, clean it. We get it to a stage where we can actually start training. We have all of these state-of-the-art models. We try different hyperparameters, train it, evaluate it on a hold-out set, and then go back again and repeat the process. And in research and academic settings, that works really well. But about a year earlier, when I joined GoDavi, I realized something, that people actually use these machine learning models for something. And it's not that you are just going to be using it for your research. It's just that other people are going to be calling this machine learning model, which means you have to get it to a stage in which you can present it to them. And it has to be really secure. It has to be really stable. It has to be able to handle all of those requests. And this is where machine learning production comes into picture. So about a year back, when I started getting into this, I had all of these terms being thrown at me, like Docker, Flask, Django, which is a framework, Kubernetes for cluster management, cloud platforms like AWS, TCP, and UISC and Nginx for requests management. And I did not know any of this, and which is why this talk is happening right now, because I thought it would be a good idea to introduce people who trained a model before, but who don't know much about machine learning production or how we can actually get a machine learning model into production. So for the purpose of this case study, we'll assume that you're familiar with training a model and the whole process. The only thing that we'll be covering is all of these new technologies, like Docker, UISC, and Flask. And for this case study, we're going to assume that we'll be training a sequence to sequence model. So if you're not familiar with it, it's basically just a machine translation system. So in that, what we do, we give it a sequence of tokens, which is just like a sentence. In this talk, we'll be assuming that we have or we will be training a model that can do this. We'll just take an input of sentence in German, and we'll be translating it to English. All right. So this is the data that we'll be working with. So this is a set of TED talks that have been transcribed in both German as well as English. And they contain all of these transcriptions based on a variety of talks. And it's a relatively small data set. So it doesn't really need too much of training. It can just get to a very decent accuracy level in the few epochs, which is why this was really good for prototyping. And the framework to which we've trained this model is FairSeq. Again, this is only for the purpose of very quick prototyping. For FairSeq is this AI toolkit to build sequence to sequence models by Facebook AI research, and it's built on top of PyTorch. And basically it contains a set of scripts by scripts, Python scripts that you can easily run to pre-process and train it. All right. Now, there's a lot of documentation available on FairSeq. But just to actually introduce you to the major or the, or a few steps that we need to do before we actually get to production level. So the pre-processing step is like you just learn a set of merge operations. So now we're going to be using BPE. BPE is by Fair encoding. If you ever work with NLP or any set of language models, you know what BPE is. But BPE for other people who are not familiar with it, it's just a set of vocabulary words. It's just a set of operations that you're learning so that you know that a few sequences are much more likely to occur than the others. So it's a way for us to figure out the patterns or the sequences in the vocabulary that have a high likelihood of occurrence. So once we have our training set, which is just both our German as well as English data, we learn what operation, what words are most common in that data through the script called learn BPE.pipon. And then we have a training set, validation set, test sets, right? So we just apply all of these learned operations to them and we store it in a file. So this is where we are till pre-processing. Now, FairSeq also gives us a really good one line command through which you can actually just start training it. And the machine learning model is usually like a sequence-to-sequence model or it could be another model built on a different framework. But because we're working with sequence-to-sequence here, you can give it the architecture, the dictionary files that you're loading, the optimizer, the learning rate, whether you want it to have a dropout or not, what is the batch size. So batch size can be given in terms of number of sentences or number of tokens, which is the number of subwords. And how many epochs you want to train it for. Now, there are a lot many hyperparameters that you could experiment with, but this is just a subset of them because we want to train a model really quickly. Okay, so now that we have our model, let us see if it can actually predict to a recent level. So in this we'll assume we have a command called FairSeq interactive. And again, what we just do is we give it a path to load it from. So this is the checkpoint or the model you've trained, what pre-processing technique you're using, and what is the beam size. So beam size is how you're predicting at each step. So like, let's say you were predicting a word after the previous one. How many candidates should you span across? How many candidates should you check for the best candidate? So it is sort of like defining how big your search space is. And you can give it all these parameters. And if it works correctly, it will print out all of them in a namespace. It will tell you how many words there are in each of your dictionaries in German and in English. And then this is a utility in which you can actually type your sentence and you can get the translations. So remember, this is a machine learning model that's taking it an input of German or sentence or just words and it's translating it into English. So till now we have the CLI command that can do that. We have a model that can do that. But as I mentioned before, like this is okay if you're working on it alone. If you're just the only one or if you're just working with a team that has access to the CLI. But in practice what happens is this is not the case. In practice we have a lot of requests coming in from other people from possibly if you're working in the industry, you have all your customers hitting this API which is calling the model. Now an API is an application programming interface. So what Flask and we'll be using Flask for this. So Flask is this web application framework that is written in Python and it actually started as a very humble programming code base. It just started as a April Fool's rank and then it blew up to be the most widely used Python based web application framework. And the reason that it has gotten to this point is because it's fairly easy to use and it's a development server. So almost everyone who starts with making apps out of their machine learning models or of anything that they've built in Python starts with Flask. And that is what we're going to be doing as well. So with this framework we already have a model. We're just going to be loading it and then trying to predict it using certain API endpoints. To do that we just need to do a few things before. We need to load the Flask modules that we think will be important. So these are all functions that will help us create the response of this app in a way that can be understood by HTTP. So we assume for our purpose that HTTP is the protocol that we'll be using. And so basically in this we need to make sure that we can create a response that isn't JSON. This is fair seek function called transformer model which will just help us load the pre-trained model. So what we do is we load this model checkpoint best into our memory. We give it the dictionary path. What preprocessing technique to use. German is a source language. Target language is English. What is the beam size and whether you wanted to use the CPU or not. So till now we have a machine learning model that is loaded into memory. Now we have to be very specific in the way we're going to be calling this model. Now remember if you're working with CLI earlier but to actually work in a much more efficient way we will define an endpoint for which the model will respond. So to do that we simply just define the endpoint keyword. So like we wanted to respond to translate. So whenever we are hitting the endpoint and it has the keyword translate at the end we want the model to predict. And the way we do that is once you define the endpoint let us define the function that will actually do that. So here we just define the timer. So like if you want to see how much time it's taking for each interest you can do that. You get the query. So request is this module that will help us get the parameter from the HTTP request. So HTTP request is someone some other person who is wanting to get translations from your model and that person will probably do something like a Q is equal to which means that he is or they are just trying to get translations on that particular sentence. So if there's no query you raise a bad request which means it's empty and then a very simple function called translate and you give it the query parameter and then you have your translation. The rest that is left is that we'll be parsing our result into JSON. We'll make this into a protocol can identify and then we'll just return it. It's a very simple function and the way to run this app is very simple as well. You do just app dot run you give it the host IP that you're trying to host it on. So if you're doing get done local host you can do that and a specific port on which this app is going to run. Okay. So now what do we have? We have this really good class server that can load the model into memory that can make predictions. We have a model that is able to do that. Now the only thing is that our class server is a development server. It's not a production based server because it's not a stable, not as efficient. It's not a secure and these are all the things we want to be existing in an HTTP server. So what do we do? We actually use this new thing called you whiskey. So you whiskey will actually help us make this class gap much more secure, much more stable. And the way to do that is that we wrap our app in the you whiskey pipe you whiskey file. So now we have this Python file. You just import the app and you give it the name application and then you just run it. But remember this is a production based server, which means you can do all of the scalability things that you might not have been able to do with flask. And to actually do that, we have this configuration file called you whiskey dot I and I and we'll be giving it certain parameters. So all of these conflict files are used to actually give certain arguments to this USB server in telling them how to run. So first we load the module. So like in the USB dot Python file, you wanted to run this application. So you give it this name. How many requests you want to listen to? So like in the queue at one point of time, how many HTTP requests can you actually load? Whether you want to disable logging. And this is suggested the first time you do this because you want to log everything you want to see just in case some error arises. What happens? The file, you want to log it to this is just a file path and lazy apps. Okay. So the way you whiskey works is a master and a worker framework. What I mean by that is so if your app is being loaded by you whiskey, it will either load in the master first and the master can give the order to load it in the workers later or lazy apps. Lazy apps allows us to actually load it in the worker itself without the master needing to initialize it first. So that obviously depends on how much memory you have. If you have a lot of memory, if you have a lot of computation part that you can load each of the apps and each of the workers, you enable the master process and the number of workers you have. You can also give it a multi-threaded application type of thing. So this again depends on how well your code is written. Obviously you'll have to account for deadlocks in case one of the tests gets delayed and the buffer size. But so like right now we're working with HTTP requests. All of these requests, they are HTTP, which means that they have headers and they have a response size. So what is the maximum size that request can take? A socket. So a socket is useful because we are running multi-threaded programs on one machine and the way to interact or the way for two programs to interact is through a socket. Now this socket is going to be used by USB and Nginx, which we'll be covering in the next slide. But for now, just think of it as a way, it's a temporary file for USB and Nginx to interact with each other. What permissions do we want to be giving the socket? Whether we want to enable threads? And whether we want the locking to happen in a separate thread or not. Now there are a lot more application or a lot more configuration options available. But these are just a subset of them because right now it's a very simple application. All right. So now we have this secure stable server with us. That is serving the model. We know the model can make predictions. The only thing that is left is let's say you were based out of California, but you have like a thousand other requests or a million other requests coming in from all over the place. Now that place or those requests, they're going to be coming not one at a time, but they're going to be coming at the same time sometimes and they'll have a specific load carrying with them. What I mean by that is often in production, we talk about QPS, which is queries per second. So you have to make sure that your HTTP requests are being routed off properly or in a very efficient manner to your server. And that is what Nginx does. Nginx will route all of these requests to your server. And we can actually tell Nginx how to do that again with the way of a conflict file. So Nginx actually comes with its conflict file itself, but we just modify it to suit our case. So this is nginx.conf. You give it a file path in line one where you want to log the errors to. And because we're working with HTTP requests, we will tell it, yeah, let's use this HTTP bracket or block. Now, if all of the requests are coming, how do you want to log it or where do you want to log it? And because right now, we assume that Nginx is going to reside on localhost, the server name is that, and you're listening to a specific port. Now, this port is different than the one that we use in plus. And this port, 8002, is actually what we're going to be listening when we create our Docker container out of this. Okay, so now Nginx knows that it has to listen to certain requests coming in on this server, this port, but then which endpoint does it actually listen to? That you can define through these specific endpoints. So like all of the paths that are arising from this home path, it knows that it has to use the uwiski parameters. And again, remember how we talked about this socket? So nginx and uwiski will interact through this Unix-based socket, which is wiski.soc, is a temporary file. And the way or the reason we use this Unix-based socket is because these two uwiski and nginx are on the same machine or on the same computer. They are going to interact with each other very fast. And Unix socket allows us to do that. So that's it for nginx. But what we have till now are nginx. So it's routing all of the HTTP requests, a very stable and secure server that is serving our model. The only thing that we haven't done is look at the big picture. So let's say all of them are running on the same, excuse me. So let's assume all of them are running on the same, sorry, my voice is just a little sad. So let's assume that all of them are running on the same computer. We still haven't looked at the big picture, like let's say your processes or your, the machine gets killed or something or it's out of memory. What happens in that case? And how does uwiski interact with nginx or which goes first, which gets killed first, which gets started first, and all of these things that actually come when you're working with a system that is in production. And the way we're going to look at that is through Supervisor. Again, with Supervisor, we have this config file in which we can actually define certain programs. So the top of the page or the top of the code, you want to be writing Supervisor and no demon, which means that it's not going to be running in background, but in foreground. Which file do we want to log the Supervisor logs to? And then another program called uwiski, what command should we run it with? What is the stop signal? What is the time it should wait before actually stopping that program? Priority. So priority means in the list of programs that we're asking Supervisor to manage. So Supervisor is going to be this process management system, right? And it's managing all of these programs under it. So what is the priority of each program? The lower the number, so like three over here, it means it's a very high priority system, it gets started first, it gets shut down last and that kind of thing. So what is the log file for anything that you're printing on SDD? What is the max size? And here I've set it to zero, which means by default, just take it to be the max size that Supervisor allows us to do. And similarly, we start nginx, we have this program command that we're going to be starting it with and the similar things that we did with uwiski. All right. So Supervisor is going to manage all of these uwiski nginx on its own. And now with all of the previous things that we've done before, you will be able to run it on your machine. The only thing is when we actually run this somewhere, we don't run it on our local machine, we either go to the cloud, select gcp or aws, we run it on a VM there, or we actually use some on-prem machines. But whatever the case is, all of these are very, like these are very dependency-based systems, right? Remember we installed all of these programs, we installed all of the Python libraries, we installed all of these uwiski nginx. So what we want to be doing right now is actually following this, but in a very isolated way. So we want to create a snapshot of whatever we've done right now, and then just load that snapshot onto some machine where we don't have to do anything and it just starts running on. And that is what Docker helps us to do. So it's a virtualization software and it will help us to create these containers on our own. And with Docker, what we do is just we write a very simple Docker file. So here we have this OS that we're loading. So like we're using Ubuntu, it's the base image from which we create this snapshot. So just like you would do with any of your own machines. So like anytime you're starting a new project, you install certain programs. So like here we install supervisor nginx, vim, git, g++, curl, and zip zip. So these are basically all the utilities starting from scratch that you would build for a system. Okay, so then we define certain environment variables. And then we just just like for a Python project, we have dependencies, we install them, copy everything to the working directory, make working parent directories. Remember this Unix socket we talk about, we just create this file, give it certain permissions. And then we have the, we set the working directory so that whenever the container starts, you know which directory it will be in. And then we just have this entry point that is set in which we have all of the commands that we want to run. So for our case, it's just going to be the supervisor because it will run you whiskey and nginx on its own. And remember this, this port is the one that nginx was listening to. So we expose this port over here. And we've created a Docker file to build the image. It's fairly simple. You just do a Docker build. You give it the name. And you start, you say that copy everything from the working directory. Docker run actually helps us to run this image. You give it a port mapping a name. And once you see all of these logs, right, like supervisor is running, your fair seek is loading the model, you whiskey and nginx are working. It means that your model is ready, or at least the programs that are required to do that already. But the way to actually check it is just through a very simple code command. What you do is remember we defined this translate endpoint. So you just give it a particular query and you get the result. Now, how do we check if something's not working correctly? You just do Docker logs. And that will give you the Docker logs of when the Docker container was starting. And then you give it, if you want to sometimes enter the container and you want to see what's happening inside the container, you want to check the log files, you do that. Certain other practices are unit tests. We haven't written any for this because this was a very simple application. But generally for each piece of code, we write unit tests just so we know where we're going wrong. And caching, which means that any time someone's going to be hitting your model, they're not always going to be unique, which means you can actually save some computation power and time by just storing those results earlier. So it's something that definitely do look out for. All right. And with that, I'm done with my talk. This is the Discord channel. And I also post a few other links that might be useful in that channel. So I'm open for questions. Thank you very much. That was nice talk. Are there any questions? Please use the Q&A button. Actually, you can already write questions during the talk because we are online. Don't forget about that. So you have to type there. Okay. So if there are no questions yet, I do have a question, unfortunately, a little bit technical one. I didn't quite understand. In Uwiski, you use the configuration with the three processes and one thread. Does that mean you have your model loaded three times in total? And then in Nginx, you have these worker connections. It's over 8,000. So if 8,000 people connect, is this distributed to the three processes? I really don't quite get. Oh, okay. So the number of threads is the number of threads per process. So essentially, it's going to be three workers. So three workers will have the same app loaded. That will actually help us by calling, if someone calls the model at three QPS, like three queries per second, it will all be translated at once. So that is the point of workers. But number of threads is different. Number of threads is number of threads in each process. So in each way that your app is loaded, your app is translating, how many different threads exist. But Nginx is another thing. Nginx that 8,002 is the port. It's not the number of connections that are mapping. It's the port that it will be listening to, to listen to all the HTTP requests. Okay. So but this still means in the worst case, your model has to be reloaded every time there is a request or do we see something wrong? No, no. So yeah, your model will not be reloaded every time. The model will be reloaded just once and then you can serve it multiple times. That is the whole advantage of this system. Okay, great. So that's the most important part. I didn't yet use UISC, but I will certainly do that to solve exactly that problem. Nice. So are there any questions? I see in the Discord chat, there are also no questions yet. You can go to the Discord, you see it online talk NLP model. You can use control or command K and enter NLP and you will find the channel. We are unfortunately out of time. So thank you very much again, Sharia, for your wonderful talk.