 Hello everyone. It's super nice to have you here today. This is the first time I talk to this amount of people, so I'm kind of nervous and anxious. Thank you for being there for me. This is an amazing experience. Thank you very much. So today I'm going to talk about recommendation engines and how to build one simple recommendation algorithm. But first, if you don't mind, I would like to give you some context about me and also break the ice. So I came from, I come from Brazil with the S actually. I know that you know Brazil with the Z but we write it with the S. We have more than 8,000 square kilometers of extension and it borders like 10 countries. This is huge. But I come from the very south of Brazil. I'll show you. This is in the Rio Grande do Sul. So a little bit about Brazil. Our language is the Portuguese. I don't know if I have another Brazilian here. Oh, nice. Nice to see you. Portuguese have landed a bit more than 500 years ago. We have 30 years of a weak democracy and I don't know if you heard but we kind of are having coup d'etats and coup d'etats all the time. We have unlimited natural wealth. And how are you? Is oi? Tudo bem? Well, so from the very south of Brazil, it's Rio Grande do Sul. We like Churrasco and Timahão. We don't have much of samba. Neither the warm weather actually. We board Argentina and Uruguay. We are full of playing fields. So we use it very much for agriculture and cattle breeding. Our daily temperature can vary from 28 to 10 degrees in the same day. And the most traditional thing we have in Rio Grande do Sul is erva matiti. It's called chimajão. So this thing in the latest hand is the chimajão. You can see a very huge piece of meat that is being cooked by the fire that is under the ground. So I don't know. I don't know if you have heard about chimajão, but I don't know why. Maybe it's because it has caffeine. Some footballers have been drinking it like the English footballers and Messi. He's Argentinian, right? You know that Brazil's Argentinians have some struggles. And this is Ronaldoinho also drinking the chimajão. He's from Porto Alegre as well. And the curious thing is it's a hot beverage and we drink it on the beach in the hot weather. It doesn't matter. We like it. So this is me. I'm a developer. I'm an economist. I'm doing a master's degree in statistics. So I'm a dating enthusiast. I'm a cat person. And I am addicted to travel. This was me, my first travel, talking about how to deal with the frustration of not having your hypothesis proven as a data scientist. I don't know if I have data science here. But sometimes we spend like days trying to prove a hypothesis. And it doesn't happen like we want. But the thing is that even if it's not a good hypothesis, I mean, it's not accepted. We can gather information of this. So yeah, this was in the human data science. Okay, let's go. What about recommendation system? So the key word for recommendation systems are revenue and customer engagement. What's happened is that we as customers are overloaded with information. So we have a lot of items out there, a lot of movies to watch, a lot of, I don't know, videos. And we don't know what to watch sometimes or we don't know what to buy. Sometimes what is the best for us? So we have a lot of information. This is a very new personalized way of selling, buying, watching and getting to know things. And well, McNeigh said that this, it helps groups of user or users to select items from a crowded item or information space. Well, Amazon, YouTube, Netflix, I mean, there is a lot of others like Udemy, Google, oh my God, almost every website where every e-commerce uses it, but just to bring some examples, Amazon in a quarter, they had almost $13 billion of revenue and this was the first quarter they implemented the recommendation engine and they had $30 billion and it was 30% more than the same quarter of the last year. So for YouTube users and for YouTube, more than 70% of the user consumption come from recommendations. So when you were watching a video and there is another video coming or you go through the feed and there are some videos recommended for you, almost 10% of the consumption of YouTube is about this. And for Netflix, 75% of the consumption comes from recommendation. So we can see it's a very big matter recommendations. So when we need to recommend something, we might be looking for answer to problems, prediction and rating. There is an approach for recommendation that is being used that it has not artificial intelligence that is like, oh, the most sold, the most clicked items that it appears and it can be easily made with a query in the database and showing the user what items have been more sold and everything. But what we want to have is things that the customers are likely to love, are likely to buy, are likely to watch. So we don't need just to predict the rating for an item, but either if they might like it or not. Okay. A little bug there. But how do you get this data, right? So data is the key and we have two kinds of data. I'm saying two kinds because I'm separated that way. I don't know, maybe we have more, even more than we haven't talked about, but we can put everything in these two categories. So implicit data. It's about tracing. It's about the data that the customer didn't give to us like their name or address, but where did they click? What kind of movie they like? And something like this, we can collect this data, right? Big data. And we have the explicit data. This is the more difficult to get because it depends on an action of the user. So answering a survey or rating items. I don't know. I think I have never rated an item. I don't know about you, but it's not common. I mean, thank for the people who rate because I'm always going to the rating, but yeah, I have never. So this is the basic models of recommendation systems with AI, okay? So we have the content-based, the collaborative and the hybrid solutions. I mean, these companies like Amazon, Netflix, YouTube, they are using more hybrid solutions that is very based on the content, but it's very based on the collaborative and has some secrets on the recipe that we don't know. So content-based. The recommendations are based on the description of the item or in the synopsis or in the genre or even in the author. There is an article on the Internet showing that an author had launched a book like in 2010 and the book was not very good sold. That's okay. In 2015, another author launched another book, but the content was very similar to this one that has been launched in 2010. And what happened was that the book that was launched in 2010 started to have a lot of people buying it and the author was like, oh, what's happening? Then tracing it back, they have seen that because of this book that was very similar to the 2010 book, it starts to sell more. So it happened. So content-based recommender systems are born from the idea of using the content of each item for recommending purpose. It avoids the code start recommendation problem. Wait for it. I'm going to talk about it. And content representations are open up to options to be used with different approaches like PNL or text processing techniques, semantic information. It has a whole world of tools to analyze this data. So another one is the collaborative and it's memory-based. Recommendations, so in this case, are based on user-social interaction and rankings provided by other users. This collaborative model is called collaborative filtering and it's divided by user-based and item-based. So let's see this problem. It's Saturday night. I am at home. I open my favorite streaming app and I don't know what to watch. I don't know. But I know my friend Marta, she likes thriller and drama. So do I. Maybe it's a good idea to ask her for a movie recommendation. And this is the truth, okay? Marta is my friend. I didn't know what to watch and I was, oh my God, I'm always like this. I don't know if you believe it, believe in this kind of thing, but I'm demony and I can make decisions. So yeah, I was like, oh my God, what I'm going to watch. So I'm showing you the user-based filtering. This on top is my friend, Henry. He likes orange, grapes, raspberry and banana. Marta likes grapes and Dani likes grapes and banana. Maybe Marta could like orange or banana or raspberry and she never have tasted it. Maybe I can recommend her. But looking at this picture, I can see that Dani is more similar to Henry because she likes more fruits. Maybe Marta just likes grapes. And if she likes more fruits and she's more similar to Henry, maybe I can recommend her oranges and raspberry. Well, but what about the item-based filtering? The thing is, again, Henry likes the same fruits, Marta as well and Dani as well, but if I go and check the items, I see that more people like banana. Maybe I can recommend Marta, a banana. And maybe she likes it. So it has been shown in the market, the collaborative filtering, more accurate than the content-based. It's easy to implement, but sometimes it's not a good idea to implement. We have to keep this in mind. When is not a good idea? When we don't have enough knowledge about the item and the user, so we don't have enough ratings, I don't know nothing about my user. I can't do the math to see with what user or item this user is more similar. So how can I recommend something? And when the item has not been ranked enough. So it's a new item and I don't have any information of this. So a problem, cold start. Cold start, as I have said before and now I will explain, is the expression we use when we don't have much information about the user and would like to recommend something. I think that helps on the cold start is, I don't know if you have Netflix or there are other sites like this, like Pinterest, you go for the first time there, you do your registration and the website asks you to select items that you like more or gems that you like more. So this helps the algorithm to recommend you things and avoid the cold start. So it's parity that is the problem when we don't have much number of ratings and new items because we don't have information. So it's not a good idea when we don't have any information because this collaborative filtering is very based on similarity calculations. So let's build our own recommendation system algorithm. This is, these are the steps to, the basic steps to be, to build the recommendation system. So first we need to call, we might choose how to do the math to find the similarity coefficient between users. Then we have to predict. So we have to find the predicted score for the movies that the user didn't watch and tell. So let the user know what to have predicted to them. So recommend it. So let's go back to my problem. It's Saturday night. I want to watch a movie. I don't know what to do. So maybe I will ask Marta, but may I try another friend? The thing is, I have this database. And as I said, it's true. My friends gave me these ratings. It's good because now we know how similar we are and how to get recommendations. So this is our database. Let me see if I can have a pointer here. Okay. So the wolf of Wall Street, Donnie hasn't watched it. Cool Runnings, I didn't. Babe driver neither. Donnie as well, didn't watch the Cool Runners. Henry didn't watch Babe Driver. And Ulta didn't watch The Lord of the Rings. So let's do the call to choose to see the similarity between us. What I have done here is I have plot a two dimension graph of the wolf of Wall Street and the devil was Prada. And I am seeing how the data is dispersed. So here I can see that Marta gave a three rating for the devil was Prada and three for the wolf of the Wall Street. I gave four and four and Philip gave three for the devil was Prada and four for 50 for the wolf of Wall Street. When I see this graph, and I try to understand who is closer to me because I'm trying to get the similarity here. So when we talk about data similarity is much about where are we together? What data points we share and things like this? So when I want to recommend something to anyone, I really want to recommend to someone that is very similar to me. So I know that this person is going to like it. So thinking about it, I can try to measure the distance. So from Philip, I'm 0.5 distance. And from Ulta, well, this is a triangle. So let's do the hypotenuse during the Pythagorean. So do you remember that from the school? So yeah, we get the distance of this cathedral and this one, then we can doing the square of those summing up and the square root, we get the hypotenuse. So doing this math, I know that I'm 0.721 distance from Ulta. So I'm more distant to Ulta than to Philip. This is the Euclidean distance. This is also a math that is used for the k nearest neighbor. This is a very used algorithm in machine learning. I would just show this formula, but I thought that if I showed you the triangle and the Pythagoras, it will be easier because I don't think people like symbols, characters and numbers, exponential, everything together. So this is just the Pythagoras with summing up everything with all dimensions we have because I have showed you the two-dimensional graph. But if I have a lot of movies, I will have a lot of dimensions. So this is how we are going to make the similarity calculation. So if I do function, similarity, get similar, sorry, I did this, I will show you the code. And it shows that regarding all movies, I'm more similar to Ulta than to everyone else. So let's predict. The prediction is that I have to predict what is going to be my waiting for cool runnings and for the driver, because maybe that's the movie I'm going to see tonight, actually, on Saturday. So here we can see that we don't have much ratings. I mean, like four of those, plus mine, are missing. So what if, like Otavio, that is the most similar to me, like we see here, hadn't waited movies. Let's suppose that we have like 50 movies, not just five. What if a person appreciate a movie that everyone else have low rated it? A solution to this problem is to use the weighted average. So let's go to the Saturday night. What movie should I watch? Pro tip, it's up to us to choose what is the recommendation threshold. We have this method, the deseretic method, that is to get my average rating and just recommend if the prediction is higher than my average rating. So recommendations for me. Here is the the code I have done for the recommendation. This measures the Euclidean distance and multiplies for the similarity for the weighted for the weighted average. So here what happened is, I have put the similarity here, sorry, the predictions we have done, it hasn't predicted the hasn't rated cool runnings. So it's blank. And I have multiplied this to to get the weighted rank. And in the end, we get a prediction for the cool runnings. For example, for me, I would rate cool runnings as 3.6. At least the algorithm says that. And the same thing for Brave Driver, I would rate it as 3.93. So here's the, let me see here. This is the function of the recommendation. And this is the output. So as we have seen, I would recommend more Brave Driver than cool runnings. What does it mean? That I may be recommended with Brave Driver by the algorithm. So I should tell my users what is the predicted. Here, doing the prediction for everyone, I have seen that I would be recommending cool runnings like this, Brave Driver like this, for Danny. These are the predictions. And at the Saturday night, I'm going to see Brave Driver. So problem solved. As I have seen here, my average rating is 3. So these two can be recommended. Maybe in my Netflix or my streaming app, these two is going to appear like, watch it now. Cool runnings and Brave Driver, they have likely to be successful of my good movies. So the code is here if you want to check, if you want to see and see how to make this algorithm. So tiny.cc error python 2019. And thank you very much. I would like to hear a feedback from you. And I'm here if you have any questions. Thank you. Maybe one question here. Thanks for the talk. I'm curious at your company, what products or items are you recommending? Nice. So I'm a data enthusiast. My company is ThoughtWorks. We don't have much work on recommendation. We have been doing some inferences like we have made for a big media company, gender prediction. So when a user enters in the website, is they a male or is a male? What age? And kinds of this. But recommendation, we have done it yet in a project. Thanks. Welcome.