 Hello, people. My name is Sahil Dua. And today I'm going to talk about a very unique combination of two technical areas. One is deep learning, which is a special branch of machine learning. And then I'm going to talk about containers, something that almost all the cool kids these days have been talking about, including me. So I'm going to talk about how we productionize our deep learning models to be able to serve real-time predictions at a large scale using containers and Kubernetes. I hope I covered all the buzzwords. So before I begin, I would like to just get an idea about the audience. So can I see hands of people who have used deep learning? OK. And those who know what deep learning is. OK, quite a lot. That's nice. And the other side, those who have built apps which run in containers. OK, nice. OK. So what it's going to be about. I'm going to begin with the applications of deep learning that we saw at booking.com. Some of them. And then I'm going to talk about the lifecycle of a model from a data scientist point of view. What are the different phases of our model? And then I'm going to talk about the deep learning pipeline, the production pipeline, which we built using containers. So get excited. Let's start. First of all, yes, who I am. Who am I? I'm a backend developer working at booking.com in the deep learning infrastructure team. That's why I'm talking about this thing here. I'm also a machine learning enthusiast, which means I spend almost all of my free time, free geeky time on learning about machine learning and new different techniques coming up. And I'm a big open source fan. I have contributed to a bunch of projects like Git that most of you probably use in your day-to-day life. I have submitted a bunch of patches. I'm also a contributor in the Pandas Library, which is a Python library for data analysis. Kinto Project, which is a storage engine by Mozilla and Go GitHub, which is a Go project by Google. And a bunch of other projects. And I'm a tech speaker. That's why I'm here on this stage today in front of all of you. Generally, I talk about topics varying from data analysis, deep learning, A-B testing, containers, when it is. So, yes, that's all about me. Let's begin with the applications of deeplearning at booking.com. But before that, I would like to talk about the scale because I mentioned at large scale what that scale means. So, we have over 1.2 million room nights booked every 24 hours across more than 1.3 million properties in more than 220 countries. Let me make one thing clear that I'm not here to show off the numbers. The point I'm making here is that at this large scale, we have access to huge amount of data that we can then utilize to improve the customer experience. So, let's see how we do that. The first application of deep learning that we saw was image tagging. And when we look at image, for example, this one, the question that arrives is, what do we see in this image? Now, this question is really easy one as well as a really difficult. So, if you ask this question to a person, to someone from this audience, it's going to be easy to say what's there in the image. It's easy to say what are the objects that are there in the image. But it's not an easy problem when you ask a machine to tell what is there in this image. And what makes this problem even harder is that the context matters. What you see in this image may not be what I want to see or I want the machine to detect. For example, if we pass this image through some of the publicly available neural networks, ImageNet or DenseNet, these are the results that we get. We get information like there is ocean view, there's nature, there is a building. Okay, that's good to know, there's a building. But what do we do with it? These are not the things that we are concerned about at booking. However, these are the things that we are concerned about. Is there a sea view from this particular room? Is there a balcony? Is there a bed? Is there a chair? Or things like that. Now, before you start thinking like, okay, it's an easy problem, you just need to detect a bunch of objects that you have. No. First challenge is that it's not image classification problem. It's an image tagging problem. That means every image is going to have or might have multiple tags. So it's not just like we classify it as something sea view or balcony. Every object is going to have multiple tags. And to top that, there is going to be a hierarchy of the tags. For example, if you see an image has a bed in it, we are almost sure that the image is going to be of inside view of the room. Unless you're in such a room where there is no bed, there is no room. So once we know what's there in an image, we can use this information to help the customer decide what they want. We can help them in deciding what kind of property, what kind of hotel or apartment they want to book if we have this information about all the images that we have for a particular hotel. Let's talk about another problem. So this is a classic recommendation problem where we have a user X. They booked a hotel Y. Now we have a new user Z. There's a debate if it's a Z or Z, but I'm going to stay with Z. So there is a user Z. We want to predict what hotel they are going to like. We basically want to recommend a hotel that they are going to like. So the objective we have is we want to find a probability of a particular user Z booking a particular hotel. And we have certain features like country or language of the user where they are coming from. We have contextual features like when they are looking for it, what's the day, what's the season? And then we have some information about the hotel, the item features, like what's the price, what's the location, is it near the beach? Does it have a pool or swimming pool or something like that? So I'm not going to go into details of this problem because I'm here to talk about the infrastructure part and I'm not going to do some point you. So research has shown that we can get better results for such a problem of recommendation system if we use deep neural networks instead of linear collaborative filtering. So once we figured out that there are going to be some certain applications of deep learning which may be really interesting for us as a business, we started exploring this field and thanks to my colleagues, Stas and Imra who are probably enjoying the rainy season back in Amsterdam, they started exploration of deep learning at booking and now these days we have some very cool deep learning models in production. So let's start talking about the life cycle of a model. What are the different phases? So there are three phases, code, train and deploy. In first phase, data scientists write their model, write the code for their model, try out different features, different kind of embeddings, different kind of interactions of different features and different architectures and once they know they are happy with the performance of the model, now they want to train their model on the production data. Once they are done with the training on production data, they want to deploy it so that it can be used by multiple clients. So we use TensorFlow, which is a nice machine learning framework by Google. We use its high level Python API to write our models because it really sort of standardizes the process of writing the model and helps get the prototypes quickly. Now these two parts, the last two parts, train on production data and deploy. These are the two parts that constitute our production pipeline, deep learning production pipeline that I've been talking about since the beginning of this talk. So you may argue why training of a model is a part of our production pipeline. That's a very valid question. You can also train your models on a laptop, right? But if you try to train your models on your laptop, you may end up looking like this, not really happy, right? So there are a couple of reasons. One is that sometimes your training data is so large that you can't really hold all of it in your memory on your laptop. And other more general reason is that your laptops are going to have limited resources like CPU cores, GPUs. In most of the cases, they're not going to have a lot of powerful GPUs to speed up the training. So if you want to speed up the training, you should train, you should not train on your laptop. And we'll see how we do that in production. So we have our big servers which have a lot of cores, CPU cores and a lot of powerful GPUs port. And what we do is we wrap our training script, we run them on our big servers. Problem solved, right? Not quite. So once we do this, there is a limitation with that. There are going to be multiple data scientists who are going to run their training at the same time, possibly on the same servers. So we might not be able to provide the independent environment that they might like to have. Plus if they want to use different version of TensorFlow, for example, we need to create some sort of independent environment for that particular training so that it defines what it wants to use. So that we don't really need to have global level dependencies or particular packages. So this is what we do. We wrap our training script in a container. So this yellow thing is, let's consider that as a container, that's the best I could draw. Not a designer. So we wrap our training script in a container, and then we run that container on our servers. So now what is a container? Container is a lightweight package of the software that contains all the dependencies that it needs to run. And then we run that on our servers, easy. So that enables us to package all the versions, all the particular versions that we want for our training, package that up, and ship it as a container. So that solves the problem of having multiple, different versions of multiple dependencies. And these containers can also utilize the GPU support that we might have on our servers, that you can just utilize whatever GPUs we have. So in a nutshell, this is how it looks like. We have our production data in Hadoop Storage that we want to use to train our model. We spawn up a new container every time we want to run a training. That container, again, has all the dependencies it needs, has a training script, and it takes the data from Hadoop Storage. Once it has the data, it runs the training. And now, while the model is being trained, we want to make sure that we can somehow save that model. We should be able to save the trained model so that we can utilize that later. We can deploy it somewhere. So we save the model checkpoints back to Hadoop Storage. Model checkpoints are like the model weights that are going to be, that are required to be able to load the model again at the deployment stage. Once we save the model checkpoints, the container dies. Yes, who can be more selfless than a container? It takes birth to do the work that you assigned, and then it dies. That's the entire life of a container. So once we have trained our model on production data, we have already covered one step of going into production. So next step is to use this model, the trained model, and put it in production so that people, not people, but different clients can use it to get the predictions back. So this is what we do. We have a simple Python app, which is a WSGISTP server. It takes the model weights that we just told in the last step from Hadoop Storage. It loads the model, no, no, no. Yes, that's weird. So it takes the model checkpoints from Hadoop Storage. It loads the model in memory, and it's ready to serve predictions. So let me cover it again. So to be able to serve predictions from a model, we need two things. We need model definition, which defines what are the features, what are the interactions, and all that stuff. And then we need to have the model weights that we gathered from the training. Once we combine both of these, we load the model in memory. It's a TensorFlow model in memory, and then we are able to serve the request. And to top this, we have a nice way to abstract all this and provide a nice URL where a client can send, get HTTP GET request, and get the predictions back. It's in the end, it's as simple as just sending the GET request and getting the predictions back. So this is what it looks like. We have app that I just mentioned. We wrap it in a container once again, because having containers solve the problem of so-called, it runs on my machine but doesn't run on yours. So it contains all the dependencies, and if it runs on one machine, it's going to run on all the machines that support containers. And so once we wrap the app in the container, any client is able to send a HTTP request, all the input features that are required to get a prediction back, and in the response, it gets the prediction. But again, as I mentioned, things are not quite simple at a large scale. So what we need to have is we need to have multiple servers. Easy solution. We had one app. We were not able to serve all the requests properly. So we have multiple apps. We replicate the same containers that we had, put them behind a load balancer, and the client doesn't know how many apps are actually serving behind. It just knows the IP of the load balancer, and load balancer is responsible for re-routing all the requests. Quite simple. But things are not generally simple at large scale. So instead of six servers in this, we need to have quite a lot of servers. As we keep on increasing the number of, when I call servers, it's actually this container. So as we keep on increasing the number of containers we have for a particular application, we need to find a way to manage these properly, because once you have hundreds of containers running a particular app, particular model, things are going to get sketchy. So we use Kubernetes to surround all of this infrastructure that we have. Kubernetes is a container orchestration platform, which helps in scheduling, managing, and scaling containers easily. It provides a really easy way to scale up or scale down the number of containers that we have based on certain principles. And it also provides easy way to make sure that at any point, we have a specific number of containers running. And if something goes wrong, some of the containers die, Kubernetes is going to make sure that it spawns up new containers to maintain that number that you specified. So that was about putting in production. But once you put things in production, a lot of responsibility comes on you. Now, you want to be able to measure the performance of these models on your production server. You want to be able to answer some of the questions like how is this model behaving? What are the latency versus throughput of your model? So let's say your model is taking some computation time to give the predictions back. But this is not going to be the time that your client is going to see. There's going to be some request overhead. And so in general, the total prediction time is going to be the sum of your request overhead and the computation time. And if in case you send more than one instances to predict on in one request, for example, if you have n instances that you want to predict on in one request, it's going to be n times of computation time that your model requires to predict. We can see from here that for some of the simpler models, like linear regression, logistic regression, or simple classifiers with lesser number of features, the request overhead is going to be the bottleneck because the computation time is going to be too small compared to the request overhead. So you need to be considered about what kind of model you have and what's going to be the bottleneck for your particular case. So once you have this information about your model, you want to be able to optimize it. And you can do that for two things. So one is you want to optimize for either latency or throughput. Let's talk about them one by one. So first is if you want to optimize your serving for latency, what's the latency? Latency is an amount of time taken by your server to serve one request, and this is from the client point of view. So how can you optimize for latency? First of all, when do you want to do that? When you have some application where you want to show some information to the user real time, you definitely want to optimize so that you can serve that request, serve the response as soon as possible. The first optimization that you can do is do not predict in real time if you can pre-compute. That may sound silly, but in a lot of cases, it's possible that you can already, you already know all the features space that you have, and instead of computing real time, predicting real time, you can predict for all those cases and save the results in a lookup table and just use that lookup table when an actual prediction request comes. Next way is to reduce the request overhead. For simpler models, when you know that the request overhead is the main bottleneck for your serving, for serving those predictions, you can optimize to have lesser request overhead, and how you do that, you can, so one of the things you can do is you can embed your model right into the application that is serving, and that's what we do. We load the model right into the memory of the app so that there is no latency, no request overhead to go from the request, HTTP request, to the model. The model is right there in the memory. Once the request comes, it's going to serve right there. Next thing is predict for one instance. Let's say on a web page, you have, you want predictions for three different models or three different instances. When you want to show the information as soon as possible instead of batching it, you should send three requests with one each prediction. This is a specific technique for speeding up the prediction. The quantization, what that means is that you change your float 32-bit types to 8-bit fixed type. How is that going to help? Now your CPU processor is going to hold more amount of four times more data in the registers, and it's able to process the data faster. So that means you are going to be able to compute the predictions faster, four times faster, to be accurate. And then there are some TensorFlow-specific things which really depends on what library you use. For you use TensorFlow, so there are some techniques like freezing the network, which means there are some TensorFlow variables in the computation graph, and we can change those variables to a constant so that there is less overhead when we try to compute. And the next thing is we optimize for inference, which means we remove all the unwanted nodes from the computation graph so that we don't really compute unnecessary things that we don't need in order to predict something. Now the second application is optimizing for throughput. Now what is throughput? Th throughput is the amount of work done in one unit time. So when do you want to optimize for throughput? Let's say you have some cron job or Uzi workflow or some offline work that needs to be done every night or every hour or every 10 minutes. There you don't really care about when a particular request arrives or when it gets a response. What you care about is how much time it takes to do this bunch of work. So that's when we want to optimize for throughput so that we just care about how much work is being done in a particular time, and we just don't care about when a particular request or how much time a particular prediction takes. Again, the first thing we can do is don't predict in real time. And this is not that I'm trying to enforce this that you should do this, but this is something that you should always consider. Is it possible to pre-compute? And have a lookup table to serve the request? If so, always do that. It's going to be much faster. Okay, batch request. So once you know that you're going to do millions of rows of work, it's better that instead of sending one request per prediction, you batch your input features. Like you have thousands of input features right in one request, and then you send the request at once and get the response back. So by doing this, you reduce the request overhead, because if you send 10,000, for example, predictions in one request, you are going to get rid of all of the request overhead that was going to be there. This is really interesting. Paralyze your request. So when you want to have a lot of work done instead of waiting for your request to get the response back, you work, do the work asynchronously. You send the request parallely using multiple workers and do some sort of callbacking option to make sure that you collect the results better and faster. So let me try to summarize what I talked about. First of all, I talked about the training of models in containers. We have, we spawn up a container every time we want to train. Get the data from Hadoop Storage. Once the training is done, store back the model weights into the Hadoop Storage. Now the next step is to be able to serve those model weights in production, in containers. We spawn up containers, which have the Python app that we want to run. It loads the model definition as well as the model weights. Once we have that model, it's ready to be served, calls for any HTTP get request. And next is we optimize our serving for particular applications. It may be real-time application where we want to optimize for latency. It may be some offline application where we want to optimize for throughput. That's all I have. And if you want to get in touch with me, you can get in touch with me on all of these social media networks. I go by sahildewa2305 username. And yeah, thank you. Thank you, that was really interesting. There's loads of questions. Lost my volume, though. This is an area that I'm actively working on as well. So I feel like I could just ask questions all day, but I'll try and keep to the questions that everyone else has asked. So the questions was around how these containers can make use of the GPUs natively. So I know TensorFlow can let you do that sort of through Docker. There's Docker things. Where is the tech with that at the moment? Can all these libraries talk directly to the GPUs or do you start to get limited when you're using containers? Yeah, so as I mentioned, we use Kubernetes to manage our containers. And when we define the configuration for a particular container in Kubernetes, we are able to define what are the resources that we are going to use from the host machine. Because every container in the end is going to land up on some host machine. And we have the flexibility to mention what resources we want to use, and we can specify we want to use particular GPU in that case, and it's going to link up or make the GPU available for this container. Nice, and is that just through TensorFlow? Do other libraries support that as well? Sorry? Is that just through TensorFlow or the GPU connection, or do things like Keras and things support it as well? No, it has nothing to do with TensorFlow, as of now. Now, once you have this GPU available on a container, once you run the TensorFlow on this container, it's going to utilize that. So it's just a feature of making the resources from a host machine available to the container. OK. There's a question, are you running this on your own hardware or are you using cloud providers to actually run the servers that are running the containers? It's tricky. So right now we have most of this in our own hardware, but we are slowly moving towards using cloud. Have you found the cloud is caught up or are you still not able to do that yet because you've got such specialist hardware? I would say we are still not there because of all the legacy stuff that we have, but it's going to be there near time, yeah. A question here, have you had issues with the size of the train networks pushing the memory requirements on your app servers too far? Have you played with distributed models and other things to try and reduce that? Yeah, this is something that I was working on actually last month. So yeah, we have tried out distributed training. TensorFlow comes with a really nice feature of distributed training, and we definitely tried it. And we are trying to find a way to put it in production in terms of makeit.automated so that we don't every time have to spawn all the workers, all the parameter servers and master nodes. So yeah, we are definitely working on that. I know that Google do this heavily, sort of distributed nets across multiple areas. I know one of their active areas of research has been creating nets that estimate what each part of the net will do and update it. So you've actually got nets that are living within nets doing estimations. It gets quite crazy. Well, that's a good question. So one question I had was once you've trained your model and you've got a new set of weights, you obviously need to load those back in. Do you just spawn a whole new version of the model with those weights in? Do you update existing models that are running with new weights? And what's the sort of latency over that? And how often are you updating the weights and training? It's actually a very nice question. So there's a feature in Kubernetes about how you do your deployment. And this is exactly about how you update your model. That's what your question is, right? So there are different options. So let's say you have 50 servers running, 50 containers running for the previous version. And now you have a new version. There are a couple of things that you can do. One is you spawn up new 50 containers and then you switch over. And another thing is you start creating the new containers and start killing them as you have the new ones available. So there are limitations like in the second way, there are going to be two versions running at the same time. So it depends. How are you OK with having two versions at the same time? If that's the case, then you go for this one. Or if you want to have only one version working at any point, you need to switch over. Once all the 50 new versions are ready, just switch over from there. Have you experimented with migrating the actual weights on each of the existing nodes to the new weights? So almost you just quickly fire rather than spawning whole new containers for just every time you train? That's actually a good point. So we haven't tried this. And I think it's not going to be as efficient because once we have the TensorFlow model in the memory, it's going to take a lot of time to replace and go and replace all of those weights. So I think it's still better. But yeah, definitely it's worth trying. It's better to create a new model, load the new model instead of going into all the nodes in the graph and updating their weights. And how regularly do you train new weights? Sorry? How often are you training new weights? It depends on the application. There are some models that needs to be trained on everyday new data, or there are some models that just needs to be trained once and then goes for once. So it really depends on what kind of model we have, what kind of application. Do you see yourself moving towards a real-time updating of training in any time soon, or are we not there yet? What do you mean by that? As in so it's constantly updating the model weights based on new data rather than doing a daily train. It does a train every 10 seconds or five seconds. No, we haven't really found an application for that yet. But once we find something on that, we're definitely going to try it out. Cool. I guess I always find it amazing if I'm searching on Google. And if I'm playing around in a different programming language to what I'm used to, when I then start typing 30 seconds later, it then starts giving me results back for that other programming language, even if I don't mention the language. And it's clearly switched what it's doing to know the context of where I'm in at the moment, which I guess is some sort of training along some route of that. It's fascinating. One, two more questions. One question was, why not docker swarm for orchestration? Why not docker swarm? So previously, we were using marathon to manage our containers. But then we moved to Kubernetes for a couple of reasons. Like, it has awesome community support behind it. It's more reliable in terms of its being evolved more often. I'm not really sure about the comparison with docker swarm. But definitely, there are certain points why we moved from marathon to Kubernetes. And this is one of them. Last question for now. Are you planning to open source any of your trained models? Maybe the image tagging model, for example. Obviously, training can be expensive. I can't speak for the company. No, I'm not sure about that. Do you actively research as a company? Do you try ImageNet competitions and things like that? Not actively, but we have tried it a couple of times before. And regarding the image tagging, we actually not really open source the model, but we open source the entire technique what we did. And we actually wrote a blog post about it. So it's all there, what we did, and how we did it. Just that the end result is not there. Cool. We'll tweet that out for you. That's cool. Excellent. Thank you very much, Sahil. Thank you.