 Sekarang kita perlu berterima kasih, saya ingin berterima kasih kepada Microsoft untuk menggantikan perjalanan Dan datang pada akhirnya kita perlu berterima kasih kerana saya ingin beritahu hari ini Dan akhirnya bukan ini, terima kasih kerana mempunyai Travoloka untuk bercakap dengan kita Kita ada Depp, yang mempunyai data sensibil di Travoloka Fokuskan pada FTS, yang adalah langkah natural, komputer visi dan berbicara Dan kemudian kita ada Yixuan, yang mempunyai sistem yang dipercayai Jadi tanpa lakukannya, tolong beritahu mereka Hai, hi Adakah anda dengar saya di belakang atau... Baiklah, baiklah, jadi kami tidak perlu fikir, itu bagus Saya sebenarnya mulakan dengan meminta orang yang telah mendengar di Travoloka sebelumnya Itu mungkin adalah pertanyaan yang terhadap bahawa ini telah diberi kelas untuk menghantar kami sekarang Tapi, bagaimana banyak orang di audience ini? Bagaimana yang diberi kepada saya? Ya, baiklah, terima kasih Untuk menggunakan fikiran yang harus dikenalkan saya Baiklah, ini dia Adakah anda dengar saya? Ya? Baiklah, baiklah Jadi, saya akan menjawab pertanyaan Berapa banyak penonton yang telah diperkenalkan di Travoloka sebelumnya? Oh, wow, baiklah Jadi orang dari Travoloka, tolong beritahu saya Baiklah, jadi siapa... Saya fikir dalam kes itu, ia selalu berguna untuk... ...menjelaskan tentang siapa kita adalah Kerana kita tidak benar-benar menjelaskan terlalu banyak... ...pada perjalanan Menurut saya, kita sebenarnya membuat perjalanan online di Travoloka Di perjalanan Anda mungkin dengar kami, tetapi kami tidak tahu... ...berapa banyak yang diperkenalkan di perjalanan Kita sebenarnya... Maksud saya, yang terbaik-baik... ...berapa banyak yang diperkenalkan di perjalanan... ...berapa banyak yang diperkenalkan di perjalanan... ...tapi kita sebenarnya melakukan banyak lagi Sebenarnya, ini mungkin bukan... ...yang mengudah jelaskan berguna aduk-aduk irritation seperti itu Tapi ini... ...atau bagaimana yang kami lakukan Kita ada... ...berapa banyak lelaki online biji Enam, berapa beberapa hotel In-en, berapa banyak cleansing Dari perjalanan dan perspektif dia, apa yang saya bercinta... ...danwayan yang akan berkongsi anda adalah... ...yang bersedia bagi perjalanan dan sekurang-kurang-kurang 40 million apps Baru-perkurang itu, perjalanan 40 million... ...yang akan dapat digadaskan ...maupun perjalanan anda Ini di mana kami meminta In terms of offices, so the only exception to office versus operations is India, so we only have a research presence in India, so we don't actually offer our platform or app as a service. So Singapore, Malaysia, Vietnam, Philippines, Ireland, Malaysia, and recently Australia. Okay, so corporate HR did ask me to just make sure you say good things about travel like this, so we love working there and that's certainly jot down the background. And a little bit about machine learning and travel. So before we get there, I thought I'd just introduce how we're stretching. So naturally you can think of us as comprised of central business units, so our products are owned by business units. So for example flights and hotels, these are products that are owned by EU. We have central engineering teams who back this up. We've got corporate functions, so HR, et cetera. And then we have the daily team. And the daily team, what we want to do is enable the company to make better decisions also. Generate actionable insight, create and nurture a culture of data driven decision making. And essentially create long term capabilities and entities for the business. Now, I hope for some of you who are noticing that we don't have goals like building the data lake or making an AI team. It's probably not a good vision and vision path. I think doing it for the sake of doing it is probably a foolhardy endeavor. So we're focused on basically enabling business impact. And what we do own to get there is things like the data lake, our core data platform, which makes sure that any machine learning models that we do deploy to production have the features they need with time. We've got our analytics teams. And we have AI and ML. I think I try not to use what AI as much as I can. I think we have a running joke that if you're writing code as machine learning, if you're in a hard one, it's AI. So I guess, okay. And of course we have AV testing. We've just got a critical component in pretty much any machine learning. In fact, any company does and should have AV testing. Okay. So within the data team, we spoke about machine learning. At Traveloca, we try and segment them into two very big in broad buckets. We think of them as machine learning for humans and machine learning for machines. So ML for humans is where we kind of generate our insights. We have a lot of models. So for example, complex metrics, we use statistical measures to come up with those so that they're valuable to the business. But this is also where we house our experiments unit, as well as our data models. Machine learning for machines is where we think of them as data or machine learning products. So you've got insights and you've got models. But there are also certain components of the platform which we want to automate or augment. And that's where we leverage machine learning to really get them to the next level. We also do R&D. I should say that when it comes to R&D at Traveloca, we're a very specific, we're not a blue sky research organization. I think we're not at a stage where we feel that's truly valuable for us. But we're certainly keen to innovate and so we certainly do a lot of R&D in alignment with what the business needs us to do. Okay, so that's us, some of us. I do apologize to the women in the room. I'm so sorry for such a monochromatic, non-diverse picture. Yeah, we definitely have some wonderful e-mail plans and analyst at Traveloca. This is just some of us. So you may see Ishwara here as well who's kind of with us. So there's a couple of our teams, so the NBS team, which is NLP Vision and Speech and our recommendation systems team just before we're off doing some AI, right? Okay, so the theme of the talk is NLP at Traveloca and now we have to work with a non-English environment. So what do we use NLP at Traveloca for? Where do we use it? So if you call the traditional, you've got chatbots. That's supposed to represent chatbots, by the way. So essentially, yes, we've got chat technologies and there Ishwara is going to demand that right now, sorry. But we have review moderation. So if you think of the platform and how people engage with us, they've got a pretty active community and they're very vocal about their feedback and how they want to express their disapproval or approval of the places they stay at on a platform. Now, at the same time, we need to make it a same and safe place for everyone to engage with. So review moderation comes into play and in terms of the loads that we get, if we needed to do it completely manually, it would cost us a lot of money to maintain that same level of quality of customer service. So this is where we do leverage machine learning and AI to help ease that. Customer support. So one of the core values at Traveloca really tries to espouse is top notch customer service. So when people engage with our platform, we try and make sure we're delivering the best experience we can to them. And that's why customer support to make sure it's scalable to do. So we do use that. Search, so if you've ever... I'm probably going to be too much detail into what such is. I think we've pretty much blessed the search of Google in these days. Knowledge engineering. So here we're basically trying to understand the relationship between entities or things and the circular component of understanding how language plays a part in that fight. So we use NLP at Traveloca. The other thing that goes unsaid is Indonesia is our base market Bahasa is a language that is spoken throughout this region. It's very popular. So what's a challenge? What's going on? Why is NLP in a non-English environment a factor? So it's interesting like a little bit to house so machine learning has evolved and transformed over the last 23 years and most of you but most of you probably encountered machine learning after this whole deep learning hype cycle. Interestingly enough the technology that powers deep learning at its core is quite ancient but it's certainly been around for 23 years. What actually happened there is computing power and data caught up. That really caused an explosion. We leveraged the hardware and data at our disposal to make a huge job and that causes a huge hype cycle. What's gone under the radar is a lot of the ancillary domain so initially it was all in the vision space but domains like natural language actually rely on a lot of I wouldn't say artistic endeavor but I certainly got a scientific research to power the core fundamental component. So you're talking about things like so dependency parts so you're saying well this is a sentence and how do you understand what these parts of speech that comprise the sentence what's in now things like that named entity recognition so how do you understand from this particular sentence this means person this is a place organization and we're going to be around for as long as there are names but SCFs hopefully but of course when it comes to non-English they're not as useful or as vigorously trade obviously in our experience and that's probably a function of the fact that labeled data sets or data sets for research in non-English languages are not as available certainly for Southeast Asian language except for that China Mandarin, Japanese and certainly some of the European languages but in general for Bahasa we don't find as much data so we're listening to key ingredient here if you think back to us we really power this hype cycle this AI cycle if you're learning live it's data and we're missing that ingredient so the other thing of course as a function of that is baseline data sets so you do some work so if how good we are because there's no agree on set baseline which says okay if you get a 40% complexity for your specific data set for your language model you're doing well or not so that's difficult to understand so you've done some work but you don't always know how good it is until you're actually testing it out and this is kind of a work travel that has a bit of an edge so if you get high quality NLP and in three buckets again this is meant to be a bit of a it's generous so it's not highly accurate but essentially if you get your pre-processing right for your specific language if you've got a good set of word and if you are comfortable implementing and tuning your chosen deep learning architectural framework I think you're in a good hand that's a good starting point for you and that results in your chosen language to the extreme so is that better? no so let's move on how many of you know what language models are okay so all of you so language models basically are really simple to understand concept what they're saying is if you're given a specific sequence of words if it's an endgame basically you're saying given a sequence of words try to predict the most likely next word so my name is and depending on the corpus it might be John it's an English corpus so that's what a language model is ULMFIT and they're kind of trying to say well a language model has these we found a framework where you can train and tune your language model in three ways so what they say is you train your core veteran or the split general corpus you say or your Wikipedia and your news article datasets then what you should do maximizing your specific language model to cater to customer support in the travel industry this is probably a good time to introduce that corpus of data and then you finally try and do a last take which is basically finding classification now let me just move on because I didn't want to make I wasn't sure what the audience composition was going to be so I didn't want to make it too technical and at the same time I do feel that these talks sometimes are phenomenal resources which help you understand these days what I was hoping to talk about is a lot of challenges which you may not have access to in wider literature so if some of these terms are a bit confusing or if you're not as aware of these I would heartily recommend reading up on these resources will give you much more clarity than hearing me talk about it in the next five minutes but probably speaking AWD LSTM specific form of deep learning your own search architecture which helps you handle sequences this particular framework helps you tune that and this is one of the secret sources which is part of your fit so the big takeaway from here is that this alco helps you train your language model better and faster so that you actually reach convergence okay so let's talk a little bit about the data the experience that we conducted and some of the results that we obtained so this is kind of what we use so we had some Wikipedia data in Indonesia we got it it's a pretty set pattern it's pretty ready to use you can get much of this information what we did internally is essentially the data scientists within the team went out to all our business stakeholders and says give me every single piece of text that we've ever had it was a hunt it was a treasure hunt it was absolutely something that wasn't ready speak to our PMs go and dig deep into our repositories of which there are many go and dig deep into our data buckets of which there are many because when you're around for 7 years you kind of change and in 6 billion set it says 99 million sort of tokens and no okay I need to speak into the mic okay I don't know mic is better alright so okay I will try and remember this so what I do want to sort of reflect here is how messy the real data is compared to the Wikipedia corpus so I want you to kind of bring your eyes to this particular number and what this F equals what this means is in terms of unique words that were in our corpus that occurred at least 5 times this number goes up to 1.9 million and we take all of them so what actually comprises the data set is things like chat logs reviews help requests so people type in e-mails say yeah I need a refund or you know I have a lot of trouble okay so they're passionate whenever they type in a list of text which means they dispel self so it's I guess the message I would say is it's very important to kind of keep handle on how messy okay I'd love it if somebody could tell me a word that's about 118 pages long I think I have a pretty decent vocabulary but I'd love to learn more likewise this guy like I mean I don't know so sadly it's something which you have to spend a lot of time managing around if I had to quantify the amount of time we spent on doing this work hands down most of the time okay okay so then what did we do we wanted to understand how to use or how to build a language model which would help us really take our NLP offerings in Bahasa to the next level to do that we chose ULMFIT and the decisions for this are not purely technical I think they were bit the data scientists were working on this was something they just wanted to try out we could have gone with ELMO or BERT which are again operating in the same space we chose ULMFIT and we did this so basically we said let's have a look and let's make sure that we're trying both things so let's validate I think there's certainly crisis of reproducible research in the community so wherever possible it's always a good idea to make sure that you're trying to replicate results so what we said is let's let's get a pre-trade word embedding off the shelf but that's always a good idea except that it isn't and then we said let's try to also train our own embedding strong scratch I think we've got enough data why not give it a shot so we did that and then we said let's see the impact of accounting for aberrations in our data and we felt the arbitrary the nice even numbers it's all about even numbers but the nice even splits so say frequency of 5 and 10 and say let's take these for quora and then we basically split our final validation set it to 3 these had all the messy so they had no filters no frequency filters at all so we ran these experiments and these are our results okay so this is we're quite excited when we saw this because the numbers that you're seeing on the column you need to pay attention to is the perp which is not the time for perpetrator but perplexity and that's basically a measure that we often use to measure the performance of language models a high level perpetrator tries to answer the question creates information entropy and it says what's the probability that the output of the language model has actually covered some meaningful aspect of your overall linguistic vocabulary so essentially low numbers are good high numbers are bad okay this is where things get very interesting so for the key and among you okay so let me briefly describe the structure so we've got the treble local corpus results over there we've got the wikipedia Indonesian results over here and ignore this one I'll come to it later what we did if you look on the column there is we also have it categorized by our pre-trained and our own vocabulary sorry word embedded and we also have it done by sort of the minimum FF5 or FF10 as well as on each validation set so what you see is our lowest of complexity is about 14.01 if you then compare the complexity of the models in the wikipedia data set it's a significant aberration continue right now for most of you who are asking yes it's probably worth it right but here's the here's the way the sort of complexity numbers for still the English performance is around 40 and last I checked for wikipedia corpora right so the fact that we're getting sort of a 14% number it's really it's intriguing because we'll overfit at the same time it's wow it's a lot higher than what we're expecting so what can we do with it now so to test that out we kind of went down this whole character-an-an approach where basically instead of trading word embeddings but we train character embeddings so the biggest difference if you think of an embedding or a vector embedding in the word space as a word is represented by number in the capital level space you're basically taking into account that a vector embedding of a word is comprised of the individual vector representations of the characters so it's much more robust to spellings or out of vocabulary aberrations right so here we did see a bit of a bump so this is very interesting because the scientist in us was screaming ah we'll overfit let's not do it but we said you know we have what we have we spent some time on it let's actually test it out in a real world use case we did train it on a very interesting email classification task and we kind of beat some of the top vendors in this case so our results are hands down superior about 7 to 8% from people who we've engaged professionally to do the exact same task so I guess that's another challenge so whenever you're actually learning try and see if it actually solves your problem before making a final decision it's often not easy there are certainly areas where you say no no no this definitely gets all the red flags and at the same time there are 70 times where you say okay well I can live with this so I want to not take too much of your time because this is actually some of the interesting stuff so managing inference latency yes sure sorry do you mind if we I will definitely answer do you mind if we just finish just in case okay so managing inference latency when it comes to machine learning and production the biggest sort of jump that a lot of people I find struggle to make is you've trained a model you've got your notebook and it's wonderful it's beautiful but now we need to productionize it so you don't really think about performance when you're doing a lot of the research but when you type in a search query there's a lot of research and it's true that basically the latency at which you can return a result has a direct impact or sort of customer satisfaction with that data product so to give you an example for our search platforms in terms of our intent detection SLAs we operate sort of a hundred millisecond so when somebody says this is a query and I need to understand what this query particularly means or what product it's referring to we have to give that result back in a hundred milliseconds now deep learning is notorious or really taking a lot of resources so what we use in PyTorch so this is something a bit specific to PyTorch but essentially what we found is trying just-in-time compilation which came as part of the 1.0 release has a benefit so there's certainly an avenue of thought I mean you can always so you've done a lot of benchmarks and you say you wanna put in chug in a 24 core CPU instance deploy your kube container over there and that gives you a certain latency or you can do it on the GPU on cloud if you're doing it on cloud but this certainly trade-off to be made cause one of them costs a lot more than the other so if you need to do low latency inference in the deep learning space it's critical to optimize for and that's often not something which is widely covered in the literature so JIT compilation basically takes your Python code and compiles it into C++ as you're sort of serving it out there at a very high level speed and this is kind of some of the numbers that we got so on our MacBook we were seeing a huge difference so we're seeing a significant gain in terms of the latency for our model serving and on our GPU server it's not as profound but I would say that given that in any case most of your GPU interfaces are already frequent parallel in C++ so it's really the final wrapper code which can be compound it's not that surprising the other thing which it's pretty news highly experimental but it's something which I think you should start talking about a lot more is in the sort of pruning so the pruning essentially if you kind of look at most deep learning models after they've been trained they're so quite sparse and what that means is if it spars then the chances are that a particular new role or a particular layer or a particular section isn't really contributing as much to the decision for your particular case of dataset as it could be in which case why not problem so think a bit like sort of compressing your deep learning training model there are various approaches to do it we were using PyTorch which is why we didn't go with the Google limitation but Intel Nirvana have kind of developed this tiller and they had a pretty interesting codebase so that's kind of another thing, right? so when you're doing some of these machine learning reproduction you're often constrained by time resources so you say well I don't have six months to do this so I can't really scratch is there a tool out there that I can really grab and have worked with so what we did and the reason again I'm putting highly experimental here is because we've tried it out it's worked for us in a very specific use case and I just wanted to give people a taste for some of the updates that we have to chase chasing performance so we managed to get I think a 30% for our best trade-off I think we got a 30% reduction in mobile memory size and that resulted in a sort of 2.3% drop in performance inference time was more or less static okay so the other challenge when it comes to productionizing machine learning is you'll have to think about model standards oftentimes in fast moving tech companies you release products which are quite new not just to your company but conceptually so sometimes you don't actually have data on how people are going to use your product in which case what becomes really important is your pre-production model perform at a certain metric I'm really happy with the precision of it it's beautiful, it's fantastic and then you launch it and people are using it in a very different way I mean we've seen people put in math equations and they're like why would you do that but you do it and then you're like okay well and the games are screaming ah customer satisfaction so some things like that happen and you have to account for it the message is figure out a way once you've deployed a specific product to be able to loop back and iterate very quickly what we're trying to do at Traveloka is basically adopt a very useful research product and we think there are four things that you need for every machine learning model in order to be able to draw a line back and repeat and iterate the data so what's your training data what's your validation set either snapshot it or if you're comfortable if you think your data stack allows for it to be an example then store it but store it the second piece is the model itself here we recommend doing both your source code naturally and of course the artifact as well but sometimes there's often a discrepancy between now you've got the report you've got the data you've trained it but somehow you store it in whatever storage of your choice and then you put it in your docker container and you want to make sure you're isolating issues like that so store both the third thing is hyper parameters so obviously you're sensitive to hyper parameter in fact so even things like seed set seeds have to be shown to have a significant impact your loss of convergence so store both as well and finally you've got you've got your data you've got your model and you've got your hyper parameters but do you know what result that actually got you so make sure you're verging that as well so these are the parameters that we're trying to adopt it's a really tough problem by the way so if anyone has any solutions please come and speak to me but we're trying to get there and I think I think in the interest of time I will skim through this this is the other thing when it comes to training we got lucky but it kind of took us sort of 10 days for our traveloka corpus or even 100 GPU machine and 6 days for the competing corpus this is going to be easy to take in a few weeks in fact for some of our early iterations we were trying things out it was just because of your train for a week so then when you're thinking about how do you get a training pipeline or something just slow moving and yet needs to be deployed very quickly or has a sort of a it's got a dichotomy there it takes a long time to train but the data that's powering your algorithm and the conditions under which your algorithm is successful will change my thing so it's important to kind of go with a pretty so invest in effort so don't underestimate the amount of effort and time it will take you to go from notebook to a full blown streamlined feedback okay so again the key takeaway some of these are not think about the form and inference so these are one of the challenges which will issue whenever you're coming up with an algorithm don't optimize crunchily so hands up we've done absolutely no hyper parameter training WDLSTM we've just taken the ones that were on paper on the repo and we haven't had to so far if we need to we'll get there I just feel that it's something which you should do if you genuinely feel that adds value or if it's not solving a problem finally I think this is a dash which people underestimate it's also something which actually people take a lot more likely when it comes to working with the image or NLP data or somehow just because you don't have the classic EDA lifecycle and more sort of quantitative or feature specific machine learning lifecycle now you're going to check your data oh yeah, I guess what I mean it was a learning process but the first iteration which took a few months was because the gentleman who was doing the experiment didn't check for frequencies so let's do it so I guess that's really my key takeaway isn't it 3x key takeaways I hope you enjoy the talk I'll be more than happy to take any questions I'll go back to that slide so please introduce yourself where you are from and now question hi everyone I'm Anu from Gojek I'm a data scientist at Gojek so my question is can you name the pre-trained word embedding which you tried and the second question is did you actually try training the pre-trained word embedding on your corpus treble over yeah so just yeah so that's a pretty good question so essentially what we did is we took the we're using word-to-beck as our word embeddings so we literally took the word-to-beck embeddings that were off the shelf from their official and when I say kind of pre-trained versus non-pre-trained so just to clarify we took the so the biggest difference here in the experiment protocol is we either started with those pre-trained embeddings or we went completely vanilla and we said let's start scratch let's rebuild our own word-to-beck on our Wikipedia which one perform better when you trained on pre-trained so here what we're saying is when we didn't use pre-trained to perform better so false here so this is basically the fact that we did use the word-to-beck here we didn't and that's exactly when we sanitize it a little bit for frequency of words this is what performed best for us and in case of rector did he use like CNNs or LSDUs no sorry so this is character-level embeddings so not RNNs or LSDUs we will take one more question and we will leave the rest after you should speak in the interest of kind so please introduce who you are hi I'm Raven okay so my question is that I don't I don't I can hear you so my question is that why do you actually need a real-time prediction for I'm getting that basically this is a you're generating a word embedding like you can always pre make inferences for the result and then basically just put it in and come to you because it's just one embedding right at the moment so not quite I mean basically the way we kind of use this infrastructure is we plug in a classifier on the end of this entire pipeline so the word embeddings then go through the language model for the final classifier so that's where the inefficiencies come to play so for example maybe perhaps we use interpretation and search as a use case but if you're thinking about natural language generation right so here's a query you understand what it is but now you want to respond when it comes to that kind of space then latency becomes critical sure you can always try fuzz it and try to simulate oh robot's taking and then it's better to err on the side of what you really need to respond could I just ask one more to confirm okay so in that case because in Indonesia the GPUs are marked generally available for most cloud platforms and I guess the network latency is one of the major issues how do you get around that? so that's an interesting one because when it comes to the stack that we've invested in obviously AWS, GCP kind of radio operator in Singapore so when it comes to handling GPU inferences it's important to make sure that you're well that's how we're getting it right so we don't really have any GPUs in Indonesia or the Indonesia cluster so we have to work with so that's why it becomes so I guess we kind of think this is the pie that you have to play with from the customer perspective and then certain components keep checking chunks of it and then finally at the end maybe the machine that's like just as much but yeah I guess the answer is there's no easy solutions we'll try our best so I'll be speaking about dynamic making dynamic personalised recommendations the algorithms and the data that's required to serve these kind of recommendations before I start actually there's a UI at the top that you want you can use to ask your questions there the audience can also go in and work on questions you want to answer later so I mentioned currently a data scientist at Traveloka under recommended systems team because I was actually a data scientist in the Zara and given my experience I've worked in public service and then in commerce in Traveloka so in the Zara and then finally right now in a travel view in Traveloka I hope we promised in various domains such as language recommendations images and fraud and my definition background is actually a major of science and statistics by US and these tools that I use they are familiar with TensorFlow Spark Kubernetes and in Traveloka we use the Google Cloud platform a lot it enables us to play machine learning on the scale of data so this agenda you have for today first of all why do we want to recommendations for our recommendations and then cover some new cases of what recommendations can do next I'll give a simplified overview of what architecture require to serve recommendations and then I will be covering one recommendations algorithm which is called smart adaptive recommendations inside it actually involves two models which is a similarity model as well as a affinity model and I'll explain how they put together to derive personalize recommendations and finally I will cover some implementation details before before I come conclude so why do we want to do recommendations so the primary reason why we want to do recommendations is that recommendation actually assist in proud discovery as well as decision making in most internet companies we have observations that looks like this but we have a lot of products but then actually the popular products form a small take out smear small proportion of your product catalog and there's a very, very long tail of products that are relatively less popular this is also known as a Pareto principle so there's a important few full of a trivial many so recommendations actually help to discover help discover of these products in the tail end because these products are niche they probably only fulfill the needs for select a few customers and recommendations actually help to display these products into the customers app or website so another reason recommendations is useful is that especially recommendations are personalize promotes customer satisfaction especially so when the products that are recommended are relevant to the product can answer the user and having real time personalize actually further encourages user interaction because if they can see recommendations reacting to their browsing or their usage of the app then they are encouraged to use the app more to see what the recommendations can be showed so use cases of recommendations there are many use cases I will be covering to that the model that are relevant that I will be covering actually touches on so first of all is the item similarity recommendations these recommendations are usually shown in the product details page of your of your application so this is what it looks like in Amazon so for example if I am browsing for a textbook on social network analysis the recommendations shows me other textbooks that are currently same topic so the purpose of this is to show the users that for other alternatives you can consider giving them look at this particular item so the next kind of recommendation is personalize recommendations this is from Netflix this kind of recommendations usually appear in the whole page or certain landing pages and recommendations like this are usually based on user inference on user's preferences and it has the purpose of showing users what they may like even though they may not be looking for it yet so next I will give you a simple overview of what most recommended systems actually require so this is a very simple overview it usually consists of three steps to derive so from four different components so first of all we start with the user interactions or the data of the users are using our website from there we select item candidates to narrow down your catalogue this is important especially if you have a large catalogue of data or items and then you come to product scoring for all the items in the special amount of time so doing subset of the product catalogue is very important in this step so after we have the item candidates we will go to the next step which is doing scoring of the items then we derive a rank list of items and then from that rank list we take the top K and then we display to the user so and on these various steps there are various models that can be used separately so for candidate selection most of it is actually doing some form of user inference doing sabotage or inferring user's preferences on maybe the activities they are spending power they are brand brand preferences and then from there we can so that can get space on that and then we scoring items obvious use case and so obvious model it's doing some relevance modeling prediction and scoring so I'll be covering just one algorithm it's called the smart adaptive recommendations algorithm actually I've been doing this also in Nazara and by the time the concept it's quite simple I didn't actually think there would be a name for it but I saw it in Microsoft website so I don't think they came to invent it but they name it so I'm just going to use this name so this is the overview of the model architecture it's very straightforward I'll be very simple we'll restart with one data that's required which is the transaction data this is actually the user browsing data only four fields from user browsing data is required the user ID the item ID the time of the interaction as well as the event type so the event type will be something like page views add to cart checkout and then purchase so from there we derive two separate models first of all with the item do we need the item similarity matrix and after that also getting the affinity matrix and after we combine these two models together and then we get these of scores and then we take the top KF item so don't worry but the picture I'll be showing this repeatedly in the next few slides so the characteristics of this model so all this algorithm is that it's actually quite straightforward it's only based on two concepts item similarity as well as essence C2 similar item in terms of the type of combination model it is this is model that is based on collaborative filtering or implicit feedback so collaborative filtering means that it's based on user's interaction as opposed to content-based filtering which is based on the item attributes or user attributes and implicit feedback means that we're not asking users explicitly to create items but rather we are infering their preferences based on how often they look at some items this algorithm is able to serve real-time personalisation I will show you how it works later and also the benefit of this algorithm is that it works on new users as long as they have at least one click one direction on the website unfortunately this does not work on the cost-up problem or for new items but actually this is not a major problem because given that personalised recommendation we have a limit WS so we do not have the luxury of providing scores for all the items anyway so we probably can show up to 10 items 20 items and most we don't have to score all the items and finally this algorithm is able to serve both recommendations as I mentioned earlier so I will touch on each of the models separately so starting with the similarity model which is actually at this part of the architecture so how do we get the similarity model from the transaction data the first thing we need to do is actually to construct this item co-occurrence matrix you can also sum for market basket analysis we are trying to find for a pair of items what is their the number of times they are checked together you can count based on checkout number of cards there are 2 items together or in my case I actually count the number of times number of distinct users that has interact with both item i and item j within a predefined period of 1 week so this matrix is actually a square matrix where each dimension correspond to the to every item in the catalog and then the x the values inside will be the number of types number of distinct users that actually interact with item so how do we derive the co-occurrence matrix from the interaction data we use 3 columns the user item as of the time and we do a self-joint by the user so we get duplicator columns for the for the item and the time and then if we take the difference between the time we can get the durations between interaction of of these 2 items by that user so we can further down and do some filtering of the of the interaction based on the direction between the interaction this is important because if we we want to consider we only want to count interactions that are close together for number 1 week instead of coding interactions that are like months apart so after we do filtering we group by item i and item j and then we go count distinct user yeah so what everybody to know is that in order to do this you need to have a very efficient kind of query engine in order to support the last data yeah so how does this actually correspond to the co-occurrence matrix is that actually the item i you can feel it as correspond to the row item j correspond to the columns and then the count will be inside the cell inside the the square matrix so after we get a co-occurrence matrix the next step is usually doing some for matrix vectorization as shown here so interesting thing to know in regards to matrix vectorization usually we get latent factor so latent factors there's a compress form you can actually imagine the latent factors actually is also dividing a form of embeddings for the item so for example if you look at latent factor u in item i item i is actually represented by an embeddings of three dimension similarly for latent factor v item j is represented by an embedding of three dimensions then we will use that embeddings later to copy the similarity so in terms of matrix vectorization there are a few obvious algorithms you can use alternating each squares or we can use to grassy gradient descent if memory is a problem and finally we can also use the growth model growth model is actually represents global vectors for word embedding based on local context i think if you look at the paper you can you'll see it's actually doing a form of matrix vectorization based on co-occurrence matrix so the concept is similar we can apply the same auto-management algorithm based on the same loss in the growth paper to derive embeddings for the items so after we get embeddings to get similarity matrix it's very simple just to cosine similarity on the embeddings this is a very common approach to get similarity form for embeddings so the whole assumption of this model refers to calculating identity is that we are assuming that items that are similar will be running to be interactive close together in time by many-many so this items similarity matrix can be pre-computed in batch on a schedule and then already with this items similarity matrix we can we can use that already to serve similarity model sorry, other similarity regulations given item we fetch the scores and then we just sort by this any item can show the items in the item similarity use case so next embedding model is also derived from the same transaction data and then we use data how do we do that this deck is relatively much great for it's just a bunch of computation for the user interaction data we actually recalculate two values applying time decay based on the time of interaction and then having separate event rate depending on the type of event so of course we want to weigh like event shots and purchases at a higher score than event shots a few given these two columns then we just want divide them together to get a fairly score so this actually takes into account the type as well as the consistency of the interaction and you can see the formula here W represents the rate of the event and then this exponent here is a time decay with a certain half-life so BP is a hyper parameter you have to define pretty fine depends on what you want to use I think in general like using 24 hours or 7 days it can work well so from this you will actually get an identity factor that looks like this and the subject here is that users have references based on the item that they are instantly interact with if you think of an identity factor it is almost just a recommend showing the last few items so it's not very useful by its own but the identity factor to derive the candidate items that we'll show to the user and what's important here is that in order for personalize personalize a convention to be able to be done in real-time this identity factor has to be completed in real-time so how do we put together the two models to finally get the personalize recommendation score actually match your call which has a linear algebra so given the identity factor and the similarity model you do a dot product you get a set of user item equivalency score and Q&A score you just do a sort in descending order take the top 10, top 100 then you can read your recommendations to the user so the concept is such that for each item so the user the item for each item it has high scores for me it has very high similarity to many recent items that the user has interacted how do you get that is that given this formula in order for R to be high it has to have high similarity with many items and then these items actually they are also recently interacted so the affinity score is high but it is reset and also it has to be because similarly the identity factor it has to be so the my navigation has to be and then you get the return score for the equivalency so next couple some improvement as well as improvement so I've been saying the need for the affinity factor to be inferior and these voters actually some very important engineering challenger first of all we need to have accessible live tracking of the user interaction logs what I mean by that is that in order for the affinity factor to be completed in real-time first of all we need to be able to track the user's interaction in real-time that means once the user has clicked on so the webpage it must be inserted into the database and not only that that database I mean as we will retrieve the information in the database in no latency immediately so this poses actually a very strong engineering challenge and then next of course the doing the completion of the