 Hi, is this good? Yeah. So I will be talking about how we serve deep learning model predictions at booking.com. And before I start, I would like to give a brief introduction about myself. So I would like to tell you what I am and what I'm not so that we have a better understanding of each other to meet the expectations. So I'm a backend developer working on developing the infrastructure for deploying the deep learning models at booking.com. And I'm also a machine learning enthusiast. So both of these things just match well for me. And I'm also a big open source fan. And I'm a contributor in a couple of projects like Gitch Tool that probably most of you have used already. And I'm a contributor in Pandas Library, as well as Kinto by Mozilla, and GitHub Project by Google and a bunch of other projects. And I'm also a tech speaker. So let me talk about what I'm not so that we have the expectations at the same level. I'm not a data scientist. And I'm not a machine learning expert. So if you have some specific questions about how things work from a data scientist point of view and really about something related to deep learning or machine learning, I might not have the best answers right now. But I will be able to point you to where you can find the answer or we can talk about that after my talk. So let me start with the agenda, what I'm going to talk about. I'm going to start with mentioning a couple of applications of deep learning that we saw at booking.com. And then I will talk about the lifecycle of a deep learning model from a data scientist point of view, like how this model looks like and what are the different stages of a deep learning model. And next I will talk about the deep learning production pipeline that we have that we have built on the top of containers and Kubernetes. And yeah, let's begin. So starting with the applications of deep learning at booking.com. The first application that we saw at booking.com, so before I talk about the applications, I would like to talk about the scale, because I mentioned we work at a large scale. We have over 1.2 million room nights reserved every 24 hours. And these reservations come from more than 1.3 million properties, which are across 220 countries. So we have this large scale, and this provides us access to a huge amount of data that we can utilize to improve the customer experience of our users. So the first application that we saw at booking.com was image tagging. The question here is, what do we see in a particular image? Like for example, if you see this image, what do we see in this image? And this is a really easy question, as well as a difficult one. So if you ask this question to a person, to a human, it's easy because we know when we look at the image, we can identify the objects in the image. And this is easy for a human. But when we talk about this question being answered by artificial intelligence, by machine learning, or deep learning, it's not a very easy one. So for example, if we pass this image to some publicly available model, like ImageNet or something else, this is what we get. We get reserves like, so there are different classes. Oceanfront, nature, building, penthouse, apartment, and all this stuff. But when we ask this question that what is there in that image, it really depends what context we are talking about. From booking point of view, this is what we are concerned about. We are concerned about whether there is a sea view from the room or not, whether this photo is of a bed or not, if it's a photo of inside a room or not, or whether there is a balcony or a terrace. So there are a couple of challenges associated with this type of problems. First of all, this problem is not just image classification. It's image taggings. That means that there will be multiple labels, multiple classes for a particular image. And also, since our context is different from what other publicly available models may provide, we need to make sure that we come up with our own manual labels so that we can tag these images. And the next challenge is there is going to be a hierarchy of the labels. For example, if you see a photo of a bed, it may be of... So we know that if there is a bed in the photo, the photo will be of inside view of a room, unless you are in such a room where there's no room, but there's only bed. So yeah, once we know that what is there in an image, we can use this information to improve the experience of the users. For example, if we know that a user is looking for a swimming pool in the property that they're going to book, we can recommend or show them the hotel which we know that if there is a swimming pool, there is some photo which is tagged with swimming pool. Or similarly, if we know that there is some customer, based on previous history, that there is some customer that is looking for breakfast buffet, we can show the hotels or properties which we know have some photos tagged with breakfast buffet. So this way, we can make sure that we are improving the experience of the customers and helping them find the hotels or properties that they want easily and quickly. Another application that we saw was recommendation system. So this is a classic recommendation problem. We have a user X, they booked a user hotel Y. Now we have a new user, user Z. We want to recommend some hotels that the user Z is more probable to book. So the problem statement here is we want to find the probability of one user booking a particular hotel. And what features do we have? We have some user features which are like country and language of the user. And then we have some contextual features like what's the day of the week when they are looking for it? Or what's the season that they are looking for it? It's winter, spring, or what's the season? And the next set of features we have is item features. Features of the property that we are looking at. Like price of the property or the location of the property or other information about that particular property. So once we realize that there are some set of applications where we could achieve better results using the deep learning, we started exploring this field and credits to my colleagues, stars Gherkin and Imra, who is a data scientist, who actually started with exploration of deep learning on different applications and now we are actually using it in production successfully. So next let's talk about the lifecycle of a model. What it looks like for a particular model from the start of the idea to when it actually is used in the application, in your applications, which may be anything. So these are the three steps. Code, train, and deploy. In first step, what we do is this is a step when data scientist writes a model, when they experiment with the different kind of embeddings, different kind of features or different number of hidden layers or any kind of, that kind of, the test or experiment with different kinds of model architecture. And once they are happy with it, once they see good results, they move towards training on production data and then they deploy. At Booking we use TensorFlow Python API, which is a high level API which provides easy to use functions to write a model architecture easily in Python. So when we talk about the production pipeline, these are the two steps that we have in the production pipeline that we call it. Training of a model on production data and the deployment in containers which can be served by any application. So you may wonder why training of a model is a part of our production pipeline. You may also use your laptops to train your models, right? But this is why it's not a good idea. So if you try to train your model on your laptop, this is what you may end up looking like. There are a couple of reasons for that. One reason is your data may be too large that you can't use your laptop efficiently. Or another reason is that your laptop won't have... Your laptop, in most of the cases, will have limited resources, will have some limited number of cores or may not have a very powerful GPU. So these are the reasons why you may want to do the testing and experimenting with the model on your laptop, but then once you are sure that this is the model you want to go ahead with, it's a good idea to use some heavy servers or some specialized servers with GPUs or with a high number of cores so that you can speed up the process and speed up the process of deployment when you actually get the model ready. So this is how the training of a model looks like. We use our servers, like we have huge servers which have a lot of cores and sometimes GPUs port as well. We wrap the training, so this is the training script for a particular model and we run that on our huge servers which are production servers. But there are going to be multiple data scientists who are going to train their models and sometimes there are going to be multiple models being trained on the same server or multiple servers at the same time. And we may not be able to provide independent environment if we do this in this way on a single server. So what we do is we wrap this training inside a container. So what is a container? Container is a lightweight package of the software which you can run on a host machine and it includes all the dependencies that your application may need. So we wrap this training script inside a container, we spawn up a container, every time we want to train a model. And also this provides us easy versioning of the TensorFlow because once we have a particular model written in TensorFlow, let's say 1.1 version. Now the new model comes up and a new data scientist wants to use a new model. We can easily have that new container have the new version and use it. So basically on the same machine, we are having different versions of the dependencies and that's why we're using containers to make sure that we have these independent environments for all of these trainings. And also it helps in the GPU support. We can also, these containers can also utilize the GPU support on our big servers that we have. So this is how it looks like. We have this Hadoop storage where we have all the production data that we want to use for training our models. We spawn up a new container when we want to train. It has a training script and it fetches the data from the Hadoop storage. It runs the training. Once the training is done, we want to make sure that the model checkpoints, the model weights are stored somewhere so that we can utilize them later in production when we deploy them. So what we do is we save the model checkpoints back to Hadoop storage and the container is gone. Yeah, so it's, what can be more selfless than a container? It takes birth to do what you want it to do and then it dies. That's the entire life of a container. So once we have this training done, we have trained our model on production data and we have stored the model checkpoints on Hadoop storage, which we can utilize now. Now, deployment is putting that model in production somewhere in servers or in somewhere where you can utilize that model to have predictions by your different application that you may have. You may have your web application or you may have your app, Android, iOS, any app and you want to make sure that you can utilize that model from those applications. So what we did was we have a Python app, which is a basic WSCI HTTP server, where you can, what it does is it takes the model weights from the container, from the Hadoop storage and it loads the model in memory. So when we want to load a model, it needs two things. It needs the model definition as well as the model weights. So we have the model definition already when we have this Python app running and we get the model weights from the Hadoop storage. We combine these and we load the model in memory so that it is ready to serve the predictions now. And on the top of this also provides a nice URL, a nice easy to use, easy to remember URL to get the predictions. So basically it all boils down to sending a GET request with all your feed parameters that you have and getting the prediction back. This is how it looks like. Again, we have this app running in a containerized environment so that it's independent and it carries all the dependencies with itself and there's no problems like it runs on my machine or it runs on this version of OS, doesn't run on that version, so it contains all the dependencies that it needs and it can run on any server where you can run Docker containers. So basically we use Docker to manage, to use the containers thing. So this is how it looks like. We have the containerized serving of our model and we can have any kind of clients which will just send us the input features and get back the predictions. But as I mentioned earlier, we have a huge scale that we operate on and when we have thousands of requests or millions of requests per second, we can't just have one server. So what we do is this, we spawn a lot of containers, put them behind the load balancer and the client doesn't know how many servers are actually serving, you just send a request to a load balancer IP and load it instead takes care of all the scheduling, all the rerouting the request. Since we have huge large scale, we have plenty of more containers. So once we keep on increasing the number of containers that we have for one application, we want a way to be able to manage these containers because it's possible that sometimes we want to increase the number of containers or sometimes we want to decrease the number of containers and we see that there's less traffic or also we want to diagnose some of the containers when something goes wrong with the containers or let's say we want to kill some of the containers and spawn them again because there's some error or something. So for this, we use Kubernetes. Kubernetes is a container orchestration platform which helps us in scheduling, maintaining and scaling applications using containers. So Kubernetes is a really nice tool by Google which provides us a really nice flexible way to scale up or scale down any application at any time. We can create new instances, new containers, put them behind the same load balancer and those containers will be now serving the applications to the request from the clients or we can scale down easily with just one command. And also, Kubernetes makes sure that if we mention that we want to have 50, let's say for example, 50 containers for application, it makes sure that even if one of the containers or two of the containers or let's say 10 containers die because of some error, it makes sure that at any moment it's going to retry and create new ones so that we don't have to care about if something goes wrong unless there is something seriously wrong and it can't create new containers. So basically it will try to maintain the number of containers to a particular limit that we have set. So once we know that how we deploy the models, we also need to be able to measure the performance of these models in production when we have a lot of requests coming in at a rate of plenty of thousands of requests per second. So this is how it looks like. Let's say you have your model and it takes some computation time to compute the prediction for a set of features, for a set of input features. And but that is not going to be the time that your client is going to see. Your client is going to also have some request overhead because of networking latency, depending on where your app is hosted and where your client is coming from. So this is how it looks like. The prediction time total is sum of request overhead and the compute computation time. And if you have end resources, you just, if you have end instances you predict on in one request, you just multiply it by computation time and this is what you get as a rough calculation of your prediction time from client point of view. And we can see that if we have some simple models where computation time is like simple models like logistic regression or linear regression where we have small set of features and it's a small model, there we will have the request overhead will be the bottleneck and the computation time will be almost negligible compared to the request overhead. So once we know this is the kind of performance that we can expect, there may be two things. Either you may want to optimize for latency or throughput. Let's talk about latency. Latency is the amount of time it takes to serve one request. So how can we, so you may have some applications like let's say you have a web application which needs to be served as soon as possible so you want to optimize for latency there and these are some of the ways that you can use to optimize for latency. First way is don't predict in real time if you can pre-compute. So this is a simple way when you can pre-compute all the results that you know that are going to be there to predict. You can just save them in the lookup table and serve from that lookup table and you will be really fast and you won't have any computation time in the real time. But we understand that it's not always possible. In most of the applications, we have the need to predict real time. What we can do there, we could reduce the request overhead. And one of the ways we could do that is we can have the model embedded in the application so that there's no latency in accessing the model and getting the predictions back. So that's what we do as well. We keep the model in memory in the container that is serving the app and so that it's able to predict and return the request quickly. Next is predict for one instance. This is useful when you have computation time which is huge as compared to the request overhead. When you know that your computation time is the key is a bottleneck for your request, you should send as many requests as instances you have. So let's say you want to predict for 10 set of instances, you should send 10 requests because you know that your request overhead is not the bottleneck here and you don't want to reduce the request overhead. You just want to make sure that you send requests as soon as possible and get the results back. And you can also do some techniques like quantization and what that means is it means you convert your float 32 values to fixed type 8 bits. And how it helps is that now your CPU can hold four times more data in the same processor and hence it becomes faster in processing that data in computing the float values as compared to computing the float values. And there are some TensorFlow-specific techniques like freezing the network. Freezing the network means that when you have some computation graph what you do is you have some variables, sensor flow variables and if you convert those variables into TensorFlow constants, you get some boost in the performance and the speed of the computation of the predictions. And another thing is you can optimize for inference. What that means is you can remove all the unused nodes from the graph and that will help in boosting up the computation again. Next is we may want to optimize for throughput. Throughput means the amount of work being done in one unit time, maybe one second, one minute depending on what your use case is. If you want to get a lot of work done per unit time, it's, again, the first thing you always do not pre-compute. If you can always have a lookup table with all the computations and use them when your request comes. And another thing you can do is dash the request. When you know that you want maximum amount of work done in a unit time, you want to reduce the request overhead as much as possible. So if you send a lot of requests together in one request, let's say thousands of requests, you're going to get performance boost of those thousand times request overhead, which you don't have now as compared to when you could have sent those requests one by one. And you can also send parallelized requests. And you can just use asynchronous request instead of waiting for one request response to come back before sending another request. You can just send them all in parallel and let the servers do its work and asynchronously collect the responses and make sure that you get maximum work done in unit time. So let's try to summarize what we talked about. First of all, we talked about training of models in containers. We spawn a new container. It fetches the data from our Hadoop storage. It can be MySQL as well. It really depends on the application. And it runs the training script in an independent environment in a container. Once the training is complete, make sure that it stores the data, it stores model checkpoints back in the Hadoop storage, and it dies. That's the entire process of the training of a model in containers. The next is serving these models from containers using Kubernetes. We spawn as many containers as possible as we need, depending on how many requests we have for that particular application. And we let the Kubernetes do its stuff with the load balancing as well as maintaining and managing the containers and providing an easy interface to diagnose all the problems that we may have. And the next is we optimize these serving of apps used for latency or throughput, depending on what the application is. If you have like cron job or something which has a lot of work to do in one burst, you can use the techniques to optimize for throughput. Or if you have real-time application in which you just need to show the result right away to the user, you can optimize your serving for latency. We have both these options available in our pipeline. To work on all these cool things and a lot of other things like MapReduce, Spark, recommender systems, and a bunch of other things we are hiring. We're hiring especially for software labor roles as well as data scientist roles. So, yeah, if you want to... If you're interested in working on these things, you may check out this link or you may get in touch with me on LinkedIn, Twitter, or GitHub. I go by SahilDuo2305 name on most of the social media websites. That's it. Thank you. Thank you, Sahil. Please raise your hand if you have a question. Oh, thank you. So, you use Kubernetes and you can scale up and scale down number of replicas, right? As you mentioned it. What do you use to decide whether you should scale up replicas or scale down? Like what algorithm is behind the load balancer? Do you do it manually or automatically? I didn't answer your question. So, what do we use as a metric to decide whether we want to... Yeah, yeah, exactly. Like, you have number of replicas, like five. Now you have load and you want to decide if there should be 10 replicas or scale down, you know. Yeah. So, Kubernetes, out of the box, provides a support to a few metrics like CPU usage, disk memory, as well as the traffic that we get in the number of requests. So, it really depends on the kind of application that we want because in some of the areas we want to have the metric CPU usage, which tells us how busy are our CPUs on the particular container. Or we may also want to use the WSCI queue size because once we have a lot of requests coming to containers, we want to make sure that those queues are not full. And once those queues are getting full, we want to spawn more containers so that the traffic maybe can be distributed so that those queues are not dropping off the request. So, it really depends. WSCI queue size is one of the metrics that we are looking at. Okay, thanks. And my second question is, how do you annotate your data for model training? Do you have some team of annotators or how do you do it? Your question is how do we come with this data? How do you annotate data, like the images? If you have some team of annotators who draw, this is bad, this is chair, this is window. Oh, yeah, okay. So, when we started writing this model for image tagging, we outsourced this tagging manually. We had some huge number of images which were tagged manually by people, by humans, and we used that kind of data to train our model. And it's some company or how did you hire them? Sorry? It's some external company that has these annotators or how you made it? Your voice is not clear, sorry. Oh, I will come after you. Yeah, thanks. Okay, so thank you very much for your talk. I think this is one of the main problems with Python machine learning, deploying it. I would be interested, if you wanna use machine learning, you usually have to do some feature engineering, like you get some input data and then you have to crunch some numbers. Where do you actually do that? Do you do that in the app? And you tell the app, okay, you have to provide this data or do you do that in the container or do you do it on the Hadoop when the data comes in and you just kind of like send a pointer to the Hadoop data? Yeah, so that's something that I didn't cover. So what we do is we have some kind of events data that is being logged in all the activities that we have on our website. And once we have the data, we have some Uzi workflows or con jobs which deal with the data and prepare the data to the kind of data that we want to use in our models. So we have separate workflow which takes care of data merging and preparation of data basically for these models. I was wondering if you could talk a little bit about how you iterate with your models and let's say, I'm not sure whether that's the case, but if you have some new training data, you want to take it into account, you want to retrain your models and then check whether they're still performing well or not, how do you deal with these kind of things? So are you asking about how we deploy new models or the performance testing of new models? Both. So once we know that we have new models, we want data scientists to want to update the model, what we do is we have some particular, so we use OpenShift on the top of Kubernetes to manage the graphic interface of the entire structure. So once we know that there's a new model, we update our deployment with that new model and we can use A-B testing to see what kind of results we get and we have proper monitoring which tells us what's the distribution of our feature sets or the distribution of our outputs for a particular model and we can use that information to decide whether the model is good or not, whether we want to keep it or move to the previous version. John, in your talk, one of the ways to improve throughput in Latin latency was to cache or to have a hash table of previous predictions. How do you implement that per container or do you have a centralized and what technology do you use for that? So the thing that I mentioned for caching and keeping the predictions in the lookup tables, that is something that really depends on your use case. So we, honestly, we haven't found out that kind of use case where we already know what kind of predictions we are going to predict. So what we do is we don't use the lookup tables, but we predict in the real time. So we use the other kind of techniques that I mentioned to optimize for latency as well as throughput. So we don't already have some application where we could employ the lookup tables. Okay, so that's it. Thank you, Sahil. Please give a warm round of applause to Sahil.