 work in the data science team in Gojek, Singapore. So I'm wondering how many of you have heard about company called Gojek? Oh, that's great, thanks. Yeah, okay, so actually Gojek is a red healing company in Indonesia with a wide range of services, like GoMassage, GoPayment, GoFood, GoRide, GoCar, and like 20 plus other services. And technical-wise, we're actually running, we're a company built on top of GoClub platform. And maybe, I think that's the reason Google would be very happy for me to present, because we're running most of our services, actually using GCP, almost like around 100% of them. So, today I'm gonna talk about some of the projects we're doing here in Gojek that's related to deep learning, and I think that's fun, and I hope you enjoy it. Due to the time constraints, I will not be able to go to details for each of them, so if you're interested, you can ask me, like talk a bit more about this after the talk. So, okay, real-time demand forecast. So, as you are seeing here, this is actually a map of Jakarta city. So, what we are trying to build is actually a real-time demand forecast engine that can give the predicted demand for the next 30 minutes in every minute, so that we can visualize the results on heat map to guide drivers to, okay, you should go there because you're gonna have a surge in demand within the next 30 minutes. So, the way we're doing this is like this. We have hundreds of time-serious signals, such as supply, demand of different locations, different service types, and we have weather signal, app usages, and so on. So, we feed all those time-serious signals into a preprocessing mode, and we get log transform, standard scaling, detraining, and so on. Then, we feed processed signals into a neural network model, and the model will give a real-time forecast for the estimate demand in different zones, as you've seen in the previous graph. So, the model we are using here, actually, we have tried many models, like vanilla RSTM models. The model we're using in the final works really well on the dataset is sequence-to-sequence model. So, basically, it has two RSTM models, the decoder and encoder. So, the encoder will encode each of the time, input time signals into one hidden, fixed-length vector, and the decoder will decode from the vector each of the next 30-minute timestamps, and at each timestamp, it'll give the output for our prediction for each of the zones. So, and we look at the test data. For one day, it's actually, the green line is actually the predict value. The orange line is the actual value. They're quite close, and based on the test. So, yeah, so the next one is fraud detection. We actually have a lot of fraud cases that are dealing here in our business. So, we have fake accounts, subsidized abuses, account takeovers, and so on. One of the most interesting and tricky ones is fake GPS. So, why do we have fake GPS? Imagine you're a driver, and you fake your location in some shopping mall, for example, where a lot of people will be aggregating, and then you go somewhere else to drink a coffee and smoke and talk to your friends. Then, whenever you got a job, you quickly come back to your location and pick up the passenger, right? But this will be a big problem with this, because let's say you're a customer, you want to book a ride, and you notice that your driver is within one minute from you. Then, it turned out the driver took like 10 minutes to pick you up, and it really resulted in a bad customer experiences, as well as unfair competition, right? For those drivers who are innocent, who couldn't get a job. And in order to come back with this, let's look at the two examples of the GPS data. So, the green dot is actually one ping from our system. So, we're actually ping the driver's device every 10 seconds. And on the left side, you are seeing actually a normal pass of a driver who are using real GPS, right? So, it looks like very real to me, and the driver down the road, and then we can calculate the speed at every ping, should be within the reasonable amount of range. On the right side, you're actually looking at a fake one, right? This guy has been stuck at one position for like 10 minutes, never moved anymore at all, and then all of a sudden it teleported to another place, and like 300 meters away within few seconds. Unless you're driving a Ferrari, you cannot get that amount of acceleration, right? Like if you're on a mobile, like motorbike. So, this guy is apparently using a fake GPS. So, what does the data tell us? Okay, so if you look at this pass again, you will notice that average speed, or the statistics of the speed, the direction changes, the accuracy of the GPS system, attitude, and so on, actually it's really important for us to make a decision whether this guy is using a fake GPS or not. And, but the challenge is that there are many, many fake GPS apps there you can download or use a paid service, right? So, each of them have different patterns. Some of them are smart, they add some noises and variances to data, make it look real. So, it's really hard for us to actually get label data for each of the fake GPS and make classification problem. So, instead, we asked about the question, why don't we train a model on real GPS data only? Then, treat it as a anomaly detection problem, right? So, what we're doing is, we derive the features for all the journeys, like statistics of the speeds, attitude, accuracy of the GPS, directional changes, so on, of each journey, right? So, each journey let's say 10 minutes. And then, we use this as features from, we derive from all the real GPS paths to train anomaly detection model, for example, like auto-encoder. Then, after we train a model, the model will learn through all the real GPS patterns. And then, in the future, we have new paths that will feed into the model, the model will tell us whether this is very different or abnormal compared with the real paths, the training data that we have used to train a model. Okay, so this is like anomaly detection. And, okay, so the next topic will be more exciting for me because if you are bored with the GPS data, time-series data, now we see a lot of pictures from now on and also the text data. So, GoFood is actually one of the, it's the fastest growing business in GoJack. And, we're now having eight million dishes in our database that we are trying to build a very advanced recommendation system to actually to make, increase our CTR and purchase rate. So, the way we're doing the recommendation is that we realize that picture is very important data for recommendation. This is especially true for foreigners like me. When I'm booking an order on a GoFood, I'm only looking at pictures because I don't understand text at all. I don't understand Bahasa. So, and also picture give you a quicker and faster, easier decision-making than if you're reading a text to order food. So, and more attractive picture will lead to the higher chance of being ordered, right? So, the way we're doing this is very straightforward. Like we have talked, like people have talked about like transfer learning today. We're actually using a transfer learning approach to solve this task because we do not have any label or text on our dishes, right? So, that's a challenge for us. And, instead we're using like this amazing food 101 data set available publicly and which you can download. And this one have 100 categories of different kinds of food. Each of them have 1000 images. And this turn out to be very close to the distribution of our food. And we think this is a very good data set for us to first fine-tune an Inception V3 model using supervised approach. Then after we fine-tune the supervised Inception V3, we use it as a feature extractor to get features from each of our images. Then, if the customer orders something like this, like fry noodle, right? Then the model will be able to calculate feature similarities between this food with all other food and it will recommend the top 10 food to the customers. So, let's look at the results. So, on the left side of the column is actually the food the customer ordered or liked before. And on the right side are all the recommended food according to what he ordered. So, that's quite a lot of interesting cases here. If you look at the last row, right? If the guy ordered something with two eggs and all the recommended things are all like this, so half eggs. So, yeah, we have a lot of interesting things we see. Actually, we compare the metric, which is MR mean reciprocal rank, which is a metric we use to evaluate the performance of a model. We see a 50% increase of accuracy compared with non-image recommendation model. So, some other examples here. So, this guy ordered some donuts and all the recommended things are donuts with different flavors. And this cake, all the recommended are cakes with same texture, but different flavors. So, this is very interesting and turn out to work quite well in reality. So, we also like visualize the whole food gallery in the intensive bar using the TSEE projection. We project, extract the future down to 3D then we visualize that. And this is really cool if you can try that intensive bar itself. You can see that you can actually visualize all your data in 3D and all the similar food like Starbucks coffee are clustered together. And so, we can know how many cluster we have in our food. But unfortunately, I cannot demo here because I need to access it to my server which I cannot do now. So, you can only, otherwise I'll shoot amazing 3D visualization. You can rotate this and zoom in to see more. So, the next one is food tagging using CNN and word embeddings. And this actually follows what we have discussed with recommendation because we do not have any tags for the 8 million food that we have in our database. So, that's a problem. And so, that's why we started this project to actually tag each food. And eventually our CEO actually have a goal to tag everything in our company. That'll be very hard but the food will be easy to start. So, we wanna tag all the food here in our database and so that we can do item level analysis and dish level recommendation, right? So, the way we're doing this is we all source a vendor to manually tag like 30,000 images, 30,000 dishes based on the image as well as a name and description of the food. And then we're using that to train our model and then to, once the model's performance is well, we scale up to predict for all the 8 million, the food that we have. So, but there are some problems with it. Let's challenge it with this. Each item may have multiple tags, right? So, you can describe an item with beverage, sweet ice, coffee, and so on, right? So, and manual labels are messy and duplicate. If you ask a human to do this labeling, so this is the difference between real-world dataset with Kaggle because you have spent 50% of your time arguing with yourself what your data makes sense, right? Before you can even start with your model. So, each person will have different opinions on the tags and they may have different understanding of the tag or even their own tags. So, you have a lot of duplicate or tags that will end up in the dataset and also you have a lot of tags that appear very rarely that you cannot use them because the model cannot learn anything from them. So, some of the example like this. So, this is, I don't know how to pronounce the name but this looks like a pork soup and so the tags are spicy, contains pork, main course and soup. And another one is not so well tagged because the tag like labeler called this a sweet dessert but honestly I think it's a salty soup, right? So, contains pork. So, some of the tags are actually not very correct. So, eventually end up with 80 plus tags from the 30,000 image of the food and we did a correlation matrix to see like the correlation between each of the tags and we noticed some of them are very correlated like the cake and bread and cake and milk and dairy are actually duplicate and there's a lot of things. So, we have spent like at least 50% of the time cleaning up the data to re-merge some of the tags and remove some of the like rare tags to make it like process before we can do the modeling. So, for the word embeddings where we're actually trying to encode each word from name and description. Here the word is from the name and description not from the tag. So, and the way we're, before we can do this we have do a lot of pre-processing such as misspelling correction like black pepper chicken, they call it black pepper and regular expressions that you can see on the right a huge amount of them convert to lowercase and remove numbers, remove like the humans like person's name or restaurant's name and only keep the words that have higher frequency than the threshold or something. Eventually we use a fox test word vector which is from Facebook actually is the latest one of the latest word vector models which Facebook claimed to be better. So, if you are familiar with Glove and like word to vet this is another kind of word vector models that actually have a slightly larger vocabulary because for some of the words that they're not able to get a vector you can train the subsets or the ingrams of the word to get estimate for the word. So, it turned out that there are like, okay, we download Indonesian word vocabulary which has 300,000 words in it include both Bahasa and English. This is really good because our food names and distribution have both English and Bahasa and this is really helpful and the dimension 300 for each word and it's actually download the word actually pre-train already so you can use directly as your initialization weights for your word embedding. So, this is the model. So, we have, this is nothing fancy but deep learning model is the image encoder and the text encoder. So, we encode image with the inception of V3. Again, this is V3 is a pre-train on the food dataset previously, right? Then we use that to encode image to get a fixed length vector. Then for the name and description we can candidate both and to get one sentence and then we do a padding and fix the length to be 20. So, every item will have 20 words if it's not enough we'll pad with zero then we embed each word with a fast test as initialization. Then we go through attention layer later on I'll be talking a bit more about attention layer then batch normalization layer and finally we have another input data which item price we use one dimension to as input for the item price then eventually we can candidate all the vectors and go through a fully connected layer and output is 33 classes or 33 labels that we have for our target. So, the last function is actually binary because we are doing multi-label classification. So, it's not softmax each of the label is actually binary classification task itself. So, and some of you may be more careful you wanna demand for dimension matching and so for this one is 20 then after embedding 300 attention will be squeezed one and batch norm does not change the dimension. Finally, if you concatenate is one plus 328. So, what is attention? Okay, now attention is getting a lot of attention recently and there's like several types of attention. One of them is dot product addicted base attention and neural network predation for attention. So, we're using the simplest one or the most actually used one which is a dot product. So, imagine this is the 20 word embeddings for your input data. So, the dimension is 20 times 300 each one of the word embedding then you do a dot product with attention matrix which is 300 by one. So, these are your trainable weights. Then you will get a logist of 20 in dimension and this logist is unbounded so you can be infinity large or small then you go through a property distribution to squash the value to probabilities then which are sum to one and each of the value is between zero and one. Then once you have the probability distribution of the logist or the scores then you just do a weighted sum of the original word embeddings. Then you get your final output. So, what it does is that after the dot product the attention will tell you okay maybe this word is more important I wanna put more weights on it and this word is not important I give a very small weight. So, eventually your output is actually a weighted sum of your original input words from your name and description of the item based on the importance that the model learn. So, the metric we're using here is a mean column wise ROC AOC. So, this is just a mean of the individual AOC for each of the 32 labels. So, the reason we're using AOC is because we don't have to manually set the threshold for each of the label because AOC is like threshold free and it's also very robust to skilled labels. If some of the label are actually have a lot of zero and very few ones this AOC also makes sense for that. And this is the log loss plot for the validation dataset nothing looks like obviously wrong and works converge quite well and we can after like 10 or 15 epochs we get around AOC of 0.95 and this is individual AOC for each of the labels and we saw the AOC in this and the other and we see that pizza, beverage, pizza, beverage, burger, coffee are very easy to predict because the shape and the item are very easy to recognize. However, some of the so-called abstract or ambiguous labels are really hard to predict like hot, right? What do you mean by hot? It could be you think temperature's hot or you think the food is spicy or maybe this guy think the food is sexy or whatever but then they I don't know so this one can be confusing so people be hard to label it correctly. So this is also a lesson we learn from this project. It's actually it's very important you give the people clear labels for them to do, right? Otherwise they will not give a very accurate or consistent agreement on the labels for the food because you have a bunch of labelers their each of them give a different opinion then it will turn out your model will not be able to learn optimistically. So in this case, right? So this is what we can improve the future that we will have to remove some of the labels that does not make sense and maybe use only labels that people can clearly understand and label well. So yeah, so that's it. Thank you very much.