 Hello everyone Today I will talk to you. Yeah, exactly about how to what we do. I trust you to make it possible for travelers all over the world to Find their perfect hotel So just something about myself I'm from Italy from more than I Italy. I have a background in both computer science and mathematics, so I am both let's say a computer science is a mathematician or maybe none of them I I Then did my PhD in cars through a Germany the topic of my thesis was Basically development algorithms for network analysis and I defended my thesis last year in December and since February I'm now working as a data scientist that trusts you with a focus on natural language processing and algorithms and machine learning So What is it the question that the major review team at trust you so the team I'm part of Is working on so what is the problem? We're we're working on so what we want to do is for every hotel in the world We want to provide a summary of Traveler reviews and why would we want to do this? Why won't want to provide a summary? Well It if I guess most of you whenever you book a hotel you first look at Reviews from other travelers and if you think about it This is quite amazing that we can do this because maybe 20 years ago or 30 years ago There was no way of doing this so whenever booking hotel what people could do was basically only pick some hotel based on Yeah, basically how it looks from the outside or just listening from some some Recommendation from friends who had already been to the city maybe And what we have now is we basically it's like having maybe for a city We want to go to millions of friends who can tell you their experience about the hotels in that city and Of course, we cannot on the other hand the problem is that we cannot really read millions of reviews whenever I wanted to whenever we want to make a decision about in which hotel to go to and What most people do is they just read a few reviews go through a few reviews and then make their decision based on that but of course different travelers also have different expectations and They look for different things in hotels So it might be that the experience of the few reviews we read from other travelers does not really reflect our expectations in a hotel So what we want to do is basically provide a summary containing all the relevant information from all the reviews we can find for for each hotel all the reviews you can find online and Make it like a short summary so that people can can read it and Immediately get all the important information about the hotel So these are these are a few examples of information we can show about hotel We can show we can be interested for example in how the building of the hotel is If it's modern clean about the view Then the hotel has if it's good for parting for example And for example another like we can also show information such as solo travelers complain about TV. So this is the also kind of detail we can provide and Oops, I wanted to so yeah, I just wanted to show how it looks from the From our website. So this is basically Can you see it? No, you can't Okay, yeah, all right, maybe I'll show it later, but let's go on with this so The disinformation is not only provided on our website. We also Provide disinformation to other websites. You probably know kayak This and you might have seen this on kayak already So kayak can can show you information about hotels for example the amenities the vibe location so this information is all data we provide to to kayak and And not only kayak but also Google so whenever you Google for hotel You see Basically, you might have seen this already on Google. So this is also data. We provide for example here we show the score given by different types of travelers and and also This is shown by Google is also data we provide and it's kind of a shorter version of our meta review so basically what we show here is Yeah, like a summary for example of the rooms guests like the the rooms and for example, they notice that Maintenance could be improved. These are examples or information about the location service and and This are a few examples basically so how do we actually get to the to the meta review? How do we Build our meta review Every week we crawl reviews from many different sources on the internet We have in our database about 620,000 hotels and we crawl quite a lot of reviews about Whenever three millions new reviews per week, that's quite a lot of data We have and we store this data in an Hadoop cluster and the first step that we Do basically on analyzing this data is the semantic analysis of it. So Basically for each review We we for for each sentence in each review we extract We map each part of the sentence to categories for example This part of the sentence refers to the bathroom or this other part of the sentence refers to the breakfast of the hotel in addition to categorizing these parts of the sentences we also Compute the sentence we compute a sentiment for them. So was the sentence positive? Was it negative and? after this part we basically Aggregate this data and we apply machine learning algorithms to to them and In the end we we generate the text that you saw before and we provide this text We showed our this text on our website, but we also provided to Google kayak and they're actually more hotels.com holiday check to several other websites So this is just to give you an overview of the technologies we use to build a meta review Of course Python so almost all of the code we write is written in Python. That's why I'm presenting this here and We use Luigi for our pipelines. We use a doop and spark for processing the large amount of data we have and We use yeah Postgres MongoDB as databases and then we what you see on the right basically are all the some of the libraries We use for machine learning just to give you an overview Okay Now that I gave you basically an overview of what we do a trust you and the meta review team does I would like to talk to you about a specific problem that my team has been working on in the in the last months and Specifically it's hotel classification So the question the questions we are trying to answer here are for example What are the most romantic hotels in town say you might want to? go to a very romantic hotel to a merry romantic weekend and you're looking for the best hotel for this or You are looking for a hotel. There is appropriate for a family holiday or here There are some more examples for example, which hotels have the best casinos the best Lakeview are the best ones if you want to go to a golf course for example or Yeah, or the best seaview. I mean there are many many questions we we can consider and Our solution of to this problem is basically Composed of these parts that I'm going to describe. So first of all we Basically represent the reviews as vectors That's the first part we need to do in order to be then able to apply machine learning algorithms to them for classification and in particular we consider two ways of Representing texts representing reviews as vectors One is TF IDF, which I'm going to present soon and the other one is Doctor vacuum beddings. I'm also going to tell you a little bit about them and After this basically as I said we apply machine learning algorithms and another thing we do is we also combine Basically the review content with geographical data whenever it applies whenever it makes sense to improve our classifications So now I'm going to talk to you about TF IDF. This is the only slide with formulas Don't be scared. I'm going to then make an example to give you an example. It's actually quite a simple method The idea behind TF IDF which stands for term frequency inverse document frequency is to Reflect reflect the importance of a term which we call T for a document D in a corpus Okay, a collection of other documents So we can introduce the term frequency, which is basically simply for term T in document D the number of occurrences of T in D divided by the total number of words in D and We might then assume that A term is important for for the document if its frequency is high Right, we could say okay if a term appears very often in document then this should be quite important, but This is not always the case because for example if we consider hotel reviews when we might have that words such as hotel They're going to appear very often in reviews. So it might not be so relevant for the specific review we are considering and So basically There's another term called inverse document frequency, which gives a higher score to words that to terms that Do not appear very often in the other documents of the corpus. So are more specific for the document. We are considering and This is simply defined as the logarithm of n which is the number of documents in our corpus divided by The number of documents in which the term T appears So yeah, this will be high if the term doesn't appear in appears only in a few documents And in the end the TF IDF score is simply computed as the product of the term frequency and the inverse document frequency So to give you an example, let's say we have a document and this document is I hope this talk is not too boring and The corpus we have is composed of I hope this talk is not too boring This talk is just as boring as filing tax returns and tax returns are not that boring so Let's say we want to compute the TF IDF score of boring and Okay, the term frequency is simply one divided by eight right because there are eight words in our document and boring appears only once so That's why we had one divided by eight But the inverse document frequency in this case is the logarithm of three because we have three Documents divided by three because boring appears in all of them. So the logarithm one is zero So what we have in the end is that the TF IDF score of boring is also zero Okay, and this makes sense if you think about it because boring is not specific for for document D, right? It appears everywhere And if we can see their hope inside we have that the term frequency is just the same as for document D as for sorry for boring it also appears one once and But in this case the inverse document frequency is higher because hope appears only in this specific document, right? So in this case, we have the logarithm of three divided by one So in the end we also have that this the TF IDF score is higher Okay, now you might be asking yourself Why do we want to compute the the? TF IDF score of terms in a document the idea is that if we compute the TF IDF score for each term we have in the corpus then We can represent Each document as a vector of the TF IDF scores of the terms there are present in the corpus So to give you an exact to consider the same example as before For document D. I hope this talk is not too boring. We have that we can't represent it as this vector that you see here so You see that a lot of in this case a lot of Terms have zero score for boring We saw it before already and for the other terms on on the right It's basically because they do not appear right they are in the other documents of the corpus, but not in D So their term frequency is zero Okay, and the idea behind this is That the TF IDF vectors of similar documents Will also be similar to each other. Okay, so if you consider for example say You have a document you have a corpus composed of recipes of different foods and you consider the TF IDF scores of two chocolate cake recipes, for example, you'll have that Their TF IDF scores the TF IDF factors will be close to each other because worse such as chocolate butter or whatever other ingredients are in a chocolate cake They will have a very high TF IDF scores for both recipes and Yeah, in our case, of course, we're talking about reviews So each document is a this the set of all reviews for a certain hotel basically and the idea is that if we have a training set so a set of Hotels for which we know that they are of a certain category for example family hotels Then we can use machine learning algorithms to classify all solar hotels and being able to say whether they are also family hotels or not Yeah, so creating training sets is quite important So we actually spent quite some time on this because if you don't really have reliable training sets then Your you can be pretty shorter. You're also your algorithm won't work. Well, so We build them based on review content We also consider the amenities that these hotels had and when it made sense We also use geographical information for this we use open-street map, which is actually quite quite an amazing project It contains geographical information for for many many categories. So for us It was actually very very helpful examples of information. It contains our coordinates of coastlines highways We ask he lives all kinds of tourist attractions. We're also golf courses casinos really a lot of information and I will also now tell you something about Word2Vac, which is also another Yeah, we and let's say the baseline for another technique we use and the idea here is quite different from TF idea so the idea here is we Say that words are similar when they appear in similar contexts So we also take the context of words into account and not just a frequency and what is the context of a word? for example, if you consider the sentence here the context of Fox is The set of words are preceding then and succeeding Fox. So for example in this case, we have a window of two and therefore quick brown and jumps over at the context of Fox and The basic idea here is that synonyms like Intelligent and smart for example, they will appear in in similar context, right and yeah, so the basic idea to to Create a word2Vac model is to train a neural network with a with a word we're considering and then The hidden layer of this neural network will be used to to represent then the word as a vector because that's in the end what we want to do right representing words as vectors and Yeah, so we'll have that words with similar context in the end will result in similar vectors. So this is a Quite famous plot you might have seen it already actually on the website where I found it It mentioned that it's illegal to talk about word2Vac without showing it. So here's the plot and So what you can see here is quite nice is that basically word2Vac encapsulates Relations between words and in particular we have that for example The distance between between similar words is also pairs of similar words also the same so The distance between king and queen for example is the same as the distance between man and woman and we can write this this nice equation that basically King minus oh, yeah, it's king minus actually wanted to have it here king minus man plus woman is equal to Queen which actually actually makes sense Yeah, and in the end we don't want to just represent words as vectors we want to represent reviews right so documents and What we use basically doc2Vac which combines Let's say vectors of all words There are in a document into one final vector. Let's see Okay, so to give you an idea then of our how our classification pipeline works The first thing as I said we want to transform right reviews into vectors so we do this with either TFIDF or doc2Vac embeddings and we since we have reviews in many different languages we do this separately for each language and Then again separately for each language we use a classifier called gradient boosting. I'm not going to talk about it now because we don't really have Enough time for that But it's basically an ensemble of decision trees. You can take a look at the Wikipedia article or you there are many resources online and Basically in the end what we do is we combine the predictions of the classifiers for each language into one final classifier Giving weights based on the number of reviews we have into different languages So basically for a hotel we have 90% of reviews written in German and only a few Reviews in Italian then we'll give a higher weight to to the German review to the German classifier Right sounds like everything should work fine right this makes sense Yeah, it did work. It does work quite well in most cases But now just to I would just like to tell you a few examples where we a few problematic cases we have They're also kind of funny. So One problem we had is that when we were classifying golf hotels We basically Realized that some hotels That were actually not close to any golf course They they got positively classified and actually what was actually quite weird was that only the German classifier So the classifier for the German language was really confident about these hotels being Golf hotels, but for the other languages, this was not the case so we were really confused what was happening and why only Germans were talking about a golf course and Know what what then we realized is that the English word Gulf is actually golf in German. So Basically the reviews were were referring to the Gulf of Naples So they were not really viewing to a golf course, but still they the there were a lot of mentions of the word golf, of course and And another problem we had is that Basically in a small town outside Atlantic City, there were a lot of mentions of casino related terms actually a lot of positive mentions and Although there was no casino close to this to this hotel. So we were a bit confused and Basically turned out that in fact in these towns there are a bit far from from Atlantic City People were really happy about being far away from casinos like to be in a quiet place And that's why they had all these positive mentions of casino Yeah, so Basically the solutions we had these are of course quite complicated problems so But with the solutions we found so there's still room for improvement, of course But the solutions we found so far is basically to also use geographic data These also have quite a lot for example with problems such as yeah casinos Golf courses so we could also use this information to make sure that we don't classify hotels there are not close to the amenity we're considering and In addition to this we we also perform quite an extensive cross validation in order to be sure that we pick Parameters that can guarantee us a high precision and then ideas for future work we could consider actually disambiguation techniques and So this is actually an idea that a colleague gave me quite recently. So this is something we might look into and Also, we could think of combining tfIDF with word embeddings because what we are doing right now is for different categories we are just picking the one that works best between tfIDF and and doctor vacuum beddings, but Basically, there are also ways of combining both. So this might lead to even better results So this is basically what I wanted to present to you If you have any questions feel free to ask or you can also contact me This is my email address and you can also feel free to take a look at our website. So Basically there you can find a little bit about what The engineering department I trust you is doing also something about job openings if you're interested and Yeah, that's basically what I wanted to tell you. Thank you Hi, thank you for the nice talk. Do you have any intuition on when tfIDF works best and when dr. Beck works best? Well, what do we? Yeah, okay. So my my intuition and something that has been shown in Practices that dr. Beck doesn't work very well when your document is short So it's in those cases. It might be the case that tfIDF is working better Yeah, this is basically if we this is something we still have to Investigate a bit more though. So basically what we did is was mostly Comparing the two methods and seeing which one was performing best, but yeah, this is also Maybe something we can investigate a bit more like why exactly this is happening But this is quite this is certainly one of the reasons so the length of the document is Influences this quite a bit and yeah Hi, first of all, thank you. I think it was a great talk. I was wondering where you Considered that maybe many of those reviews might be fake and whether that's a problem You need to tackle Yeah, this is actually a very good question. So one thing is in general a trustee We're only considering reviews from verified sources. So for example only reviews from booking.com Where you can know for sure that one can post a review only if it this person has already as really say to the hotel or Google we also use reviews from Google. So we don't really use Google from sources from which There's no way of really identifying the person But yeah apart from this Whether a person Whether didn't use his fake or not whether sorry the reviews is fake or not This is not something we are Checking yet, but this is also something we talked about. So there's It might be that in the future. We certainly plan to consider this as well Thanks for your talk. It was really good. Do you have any recommendations for tools for labeling data? I struggle For label it for labeling data. Do you have any recommendations for tools or applications that help with that? Okay, what do you mean exactly like labeling? So say yeah, so say you've got webpages and you want to label The data you find on the HTML. Do you have any tools that help for help with that? I Not sure Yeah, yeah, so for building up for building a training data set you've got some Okay, I'm gonna start labeling that and you've got lots of them to do right. So yeah, sure. Is there any tools? We didn't really use any tool in our case. So we mostly yeah, so we considered mostly To review content information the frequency of terms and we you we looked at amenities as and things There are basically specific to our problem like specific to for example Lake hotels so the vicinity to lakes for example. So we didn't use any specific tool. So Keep looking Thank you for the great talk You mentioned that example with German where golf means two things Do you know how word to work behaves when when you have synonyms like that that fall on the same word? Like golf with different meanings. Yeah, like what what does the German golf vector look like? Yeah, that's a good question. I Mean No, I'm not sure exactly. Yeah This is yeah A difficult question. Yeah Google translator like if we can use that or what is the question? No, no, we don't so actually we really like the classifiers we consider them really separately So we don't translate any text. We have separate classifiers for German separate classifiers for English We don't use Google translate Yeah Hello, thank you for the talk. I wonder about the data Pre-processing so there is some data cleansing that you had so major one that you do so kind of clean up the data set and perhaps identify a Name of places so things like that Yeah, so I can't give you too much details about this because actually this is done by a different team This is not exactly so there's another team that really works on their performance as semantic analysis And also takes care of Yeah, cleaning the data for example removing stop words Yeah, tokenization limitization all this kind of thing. So a team there's a team that is taking care of that And So sorry your question. You asked me What we do exactly or yeah, I was wondering the Sun general ideas if you could talk a little about that So I can give you just these general ideas because I'm this is something that a different team does. I don't know if Maybe Stefan you want to add something on this or It's okay Well, maybe we can take this then offline to other also people here from my company also can give you maybe more details Okay, thank you Okay folks, we are out of time. So let's thank our speaker again