 very happy to be here. This is my fifth time in Barcelona and it's by far my favorite city in Europe. So today I am going to be talking about the process of building a recommendation engine and it's the early part of the process. I don't know what I'm doing. I need to figure out what the hell I need to do to actually make this happen. And to add more complexity to that, I decided to turn it into a product so I could actually make money and retire early. So I am Brian St. Bowden. Again, full stack fest. What a wonderful show. I'm BS Bowden on Twitter. I've been writing code since about 1986 on whatever I could find and get my hands on. I am from Panama as you can tell by the British accent. And I live in Arizona, the state of acceptance and tolerance. I'm the CEO of a small company and we do Ruby, Rails, Java and whatever can keep the lights on. So this talk, it's about the glimpse into the ML world, into the machine learning world and what I've gone through to actually get anywhere. One of the funny things that happened at the beginning, I took an AI class in 1994 and I thought that after graduation, that's what I was going to be doing, AI. Everything was going to be AI. Fast forward a couple of years, I was doing tax software. And then after that, accounting software. And then I thought, this is not what I signed up for. So I've been trying for the last few years to bring that back and actually enjoy programming and have my brain really engage in what we do. So this talk, it's also a backgrounder on recommendation engines. And a bit of a primer on classification. And specifically text classification. Classification really is the concept that it's behind most of the really cool AI that's happening right now. Whether it's supervised learning or supervised learning or deep learning versus shallow learning, it is kind of at the core of everything. So the subject is really broad and deep, too. Just for this presentation, I probably read close to 40 PhD dissertations. And it made me realize that I've forgotten 90% of all the math that I need to survive as a computer scientist. So that's on my agenda, to go back and relearn a lot of the math. This talk, it's not about everything that you need to possibly know about a recommendation engine. It's also not a bloodbath into the math because otherwise you guys would be asleep. And there's not a lot of code. And there's a reason behind that. So let's get started by talking about what a recommendation engine is. And most of you interact with one pretty much every day. Most of you have probably built a simple one. Maybe a few of you have built a really complex one. But they are a fact of life. If you have a system that sells or promotes or, you know, displays any kind of information, the amount of information nowadays is growing at a pace that we cannot keep up with. And searching doesn't cut it anymore. Half of the time I spend 20 minutes trying to craft the search query on Google to find what I need. So now we need to actually invert the equation and have the systems be smart enough and know you well enough to push stuff in your direction. So in 2004, I read this book on the recommendation of a friend of mine. And this book is called The Paradox of Choice. In the premise of the book, it's that there's so many choices that we are completely overwhelmed. And the side effect of that feeling of being overwhelmed is that we do nothing. How many times have you been looking at the TV guide for a full hour where you could have watched a whole movie, at least a short action American movie with very little dialogue, but you spend an hour searching for a movie. And at the end of that search, you go like, nah, you know, I don't have an hour to waste on a movie. But you spend an hour looking for it. So that happens to me all the time. And it's an exercise on really now knowing what's on the TV guide, but never watching anything. So a few things happen. First of all, it's the Americans are really good at this, the illusion of choice and freedom. By having all these choices, we feel really free. But what happens is that we end up doing nothing or we end up being overwhelmed, and then our happiness levels go down. Now, when that is what happens at a website, and you are the owner of that website, that means sales are going in the direction of that happiness. They're going down, too. As I mentioned, freedom and commitment. We feel that we're finding those two forces. When we have to spend so much time crafting the idea of what we want, it's really hard to then get to the end line and be like, okay, let's do it now. How many of you have put a book that you spend hours searching on a shopping card and abandoned that shopping card? Show of hands? Exactly. That happens to me all the time. Sometimes it's because I'm a cheapskate. But sometimes it's because I feel that the effort that I put into looking for something, that's it went into the search and not the actual reward of getting. So it was like the destination versus the travel. I spent the travel, the energy in the travel and did not get the reward at the end. Also, our brains, and according to this book, our brains create little, you know, you can think of mini-rule engines of how you go about behavior when you purchase something. So, for example, one of the things that I typically do, and this is what they call second order decisions, it's that every time that I go to a technical bookstore, I put a book on the shopping card and then I, let's say for a new language, elixir. Everybody seems to be getting that one now. And what I do right away is I spend time searching for all the web frameworks built with that new language so I can pick the two books and buy them together. Because typically Amazon or some website will bundle them and give you a discount. So I'm like, I'm taking the whole ride together, I'm getting the language and the framework. And what happens is that I end up putting those two books in there and then I start questioning my decision. And what happens, it stays in the shopping card until the book's out of print. I've gotten those notices. It's like, you have a shopping card that has a book from, you know, 2005. You probably want to, you know, get rid of it. And then the missed opportunities problem is that every time that we spend so much energy in that journey, then we question what other avenues we could have taken. So if somebody, if you go to a bakery and there's hundreds of cakes, you're likely to basically walk away because of the number of choices. I want, I will eat the cake. I'm telling you that. But sometimes you say, well, you know, I haven't had Milojas in 20 years, but I really like Keyline Pie. So then you basically end up in this paradox of choices. And if you take that into the shopping card world, that means an abandoned shopping card. So recommendation engines are built to combat this paradox of choice. They also help us with the exponential explosion of information. And they're now coupled with growing user expectations. Users want more. They demand that the systems know them. They want, for example, I want Netflix to really figure out what my taste in movies is, which it seems not to have it right. And they are probably at the leading edge of AI research in recommendations. So a recommendation engine, let's define it so we can actually now move into see how one of these things is put together. It is a system that predicts the level of interest that a user might have on an item. And an item could be pretty much anything. Let's think of it as a product on an e-commerce site, which is typically what we are talking about. But it could be an article on a new site. It could be another human being in some kind of dating site. There's a lot of different things that as long as we can classify, we can compare them. And if we can compare them, then we can do certain things to be able to recommend them to you. So again, an item, it's anything. A book, a movie, a travel destination, articles, recipes, clothing. One of the great examples of how the paradox of choice is being combat in the US right now, and probably in a lot of different places, are the places that will basically tailor an outfit, a fashion persona to yourself, and send you an outfit every month. My wife loves this place. I started trying it. They just don't seem to get this part of my body to kind of work correctly with their outfits. But other than that, it's amazing. They send you something every month. And whatever you return, those becomes basically un-likes in your profile. So now they can mine that information, and first of all, figure out what other users are like you in terms of body type style demographics. And they can use that now to recommend different ensembles for people to wear. So it's a pretty interesting world that we're living in. One of the most common and famous recommendation systems, one of the first ones, it's Amazon. I love that I have a laser. I haven't had one of these in a long time, and I'm really excited about using it. So watch your eyes. So Amazon, it's one of the first e-tailers that actually started doing recommendations and doing them fairly well. But they started with a very simple system. And at one point, that system did not scale. And I will talk about some of the things that they did to actually make it scale. And there's two different types of e-commerce environments. You have places where you have hundreds of thousands of items, and maybe thousands of customers. And then you have the ones where you have millions of customers in hundreds of items. And then you have the worst case scenario, well, not for your bank account, but where you have millions of items and millions of customers. And to be able to compute those recommendations becomes a pretty expensive task, computationally speaking. So another place that actually it's the forefront of recommendations in Netflix, Netflix uses information about the users and their behavior and other users in their behavior, and also about the items themselves. So combining those two concepts, it's something that it's becoming a trend. Early on, there were recommendation systems that did one or the other, either use the collective intelligence of the masses, or try to figure out who you are and what you want, and try to figure out what is this item about, what is this book really about, and can we match those two? iTunes have basically kind of dropped the ball in the last probably five years. And it's weird because, you know, Siri and all the other systems have advanced so much, but the iTunes recommendations seem to have been left behind. Maybe that department is not really well staffed nowadays. Sappos is another place that actually uses very simple recommendations, because as I play with a lot of e-commerce sites to learn about this, what I noticed about them is that they just showed me different color variations of the item that I was looking at. Even though I have purchased things before. So I could be wrong, it could have been a fluke, but that's what happens. And there's no way I can actually wear those shoes. I'm not hip enough or young enough anymore. So Netflix, in 2009, they did the Netflix price, and they offer $1 million to anybody that could beat them at the recommendation game by 10%. And they make machine learning sexy again. Of course, now everybody's a data scientist. I've seen people that, you know, have two years of programming and they're a data scientist. And I'm like, so I'm a want to be data scientist, but I can tell you, I'm not even anywhere near to basically have enough knowledge to call myself that yet. So getting back into recommendation engines. Now there's a couple of ways to do this. And I kind of went around, beat around the bush to kind of mention the two ways, but I'm going to formally define them now. So the recommendation approaches that there are are one that is the wisdom of the masses. It's called collaborative filtering. And it relies on collecting a lot of data about the actions of your users. So for example, if I'm on our website, and I like something that's an implicit way to collect the data. But let's say that I did a search, and I clicked on an item, and I browsed it for 10 seconds. They might gauge my interest maybe on about how much I scrolled through the page, the page that was still active, and say, well, maybe we should give a weight value to Brian's likelihood of enjoying this item. Now on the other side, you have content based approaches. And the content based approaches is where the modeling and the learning comes into place. For example, some items are very rich in information about the items. So for example, a book. There's the possibility that you buy a book, because of the title and the cover, usual saying it's do not judge the book by its cover. Titles are very misleading. Then you open, read the first chapter, and you're like, this is not what I thought I was going to be reading. But they have a lot of content. So if we can mine that content or just part of that content, let's say the summary of the book, the table of contents, we probably have enough information to match it to things that you don't, that you browse, purchase before without you having to actually open the book and read through a page or two. And that's the idea with some of the content based recommendations. To really analyze what makes the item the item and match that to your interest and see if you can actually create a list that you can push to the user that way. Obviously, somewhere in the middle, it's the sweet spot nowadays. It's where you have a hybrid environment where you could concentrate in mixing all the data that you get from your users and the content of the actual items. There's also another way to classify recommendation engines, which is based on the algorithms that they use. Memory-based are the ones that really happen with statistics and computations that work on pretty much the whole matrix of users and items. And you have model-based, which is where the machine learning comes into place, where data mining comes into place, where data enrichment comes into place. And again, mixing those two, it's the hybrid approach that seems to work the best for most websites and systems. So let's talk about machine learning. In machine learning, it's a very broad subject. Some parts of it are easy to digest. Some parts of it are really complex. So having a roadmap of basically how to go about machine learning, can you say that three times fast? It's actually hard. So what I did first, it's basically find an environment that fostered the understanding of what I was doing. I realized that writing code, I wrote a bunch of code in Java, in Ruby, I learned some Python to do some of this stuff, and I was overwhelmed, completely overwhelmed. So first, I needed to find where to start. And I started with classification, which is basically supervised learning and predicting what bucket a certain item belongs into. I'll probably, regression is actually also fairly easy to understand, at least at the surface level. And then it gets really complex. Feature selection, to me now, that's the hardest topic in recommendation engines. And we'll get into that. And then you have things like anomaly detection, finding what data point doesn't belong in the set. And grouping, how can you, without knowing the makeup of the items, have them cluster themselves in a way that you can say, well, I know that I have three piles in here, now I can maybe investigate what makes those three piles different from each other. So learning, machine learning. My recommendations first are that you get the big picture first. Whatever you're doing, it's kind of like writing a software application. You basically write stubs for everything and you return simple things like strings, and you make sure that you have a front to back and return trip, and that everything works. And then you can concentrate on tweaking the algorithms, and tweaking the parameters of the algorithms, and maybe having multiple algorithms compete and vote to see who gets the best result. All that stuff, it's going to be necessary to actually fine tune what you do with machine learning. But if you try to do it from the beginning, it is really frustrating. When you write code, write really small code samples around a specific algorithm with a known payload and a way for you to test the output. And focus on data first. This is one of the biggest mistakes that I made trying to basically get into this world. If you don't understand the data, if you don't manage the data, if you don't know where to get the data and to clean the data, you're going to fail miserably. It doesn't matter how amazing your system is, knowing how to get the right data and to get it in the right form, it's the first step to success. So in terms of frameworks, as a programmer, obviously, again, it sounds like we need a recommendation engine to recommend machine learning frameworks. This is only a small sample of what's out there. There's hundreds of machine learning platforms, frameworks, standalone algorithms that you can actually use. But like I said, first focus on learning and understanding the whole process. One of the one that's becoming the hot commodity, it's TensorFlow. And also, if you're learning and you're a Pythonista, I would say start with scikit and move to TensorFlow. If you're in the Java world, down here, I put basically the collection of things that are Java related. There's some .NET stuff and also, obviously, Scala, Ruby, and I believe this is Python 2, but there's a few things out there that can help you basically get started. So my approach was to pick something that was easy for me to grasp the whole process. And for that, I picked this product called RapidMiner. RapidMiner, it's going to look like an IDE that a lot of you ran away from in the past. Those Java developers that became Ruby developers probably remember Eclipse. So when I first opened RapidMiner, I'm like, oh, no, I had like asset reflux, and I started sweating, and I'm like, no, no, I don't want to go back. But it's a magnificent product. It is amazing. And it actually has taught me more about machine learning that writing a lot of code, which is you think it sounds counterintuitive, but it's the way that it worked for me. So RapidMiner, again, it's an Eclipse-based application that has an amazing set of operators to basically do everything from classification. Even right before this presentation, I found a recommendation extension where I could use to basically then test some of the systems that I was building in parallel. So give it a chance. Obviously, you need Java installed on your machine. So if you can live with the virus, then this is coming along for the ride. Just kidding. The Java VM, it's an amazing platform, and I stand behind it 100% the language. Okay, so let's talk about one of the first types of collaboration engines, which is the ones that rely on collaborative filtering. So sometimes known as the click-based recommendation engines. And again, the idea is to collect large amounts of user feedback, both explicitly and implicitly, and infer preferences for the specific users that way, so you can recommend items. So feedback, this is a topic that is actually dicey. You can get into a lot of privacy issues, creepiness. I, for example, when a system recommends an item to me, I want to know why. You know, some people are okay with the mystery. It's like, how did they know that I like, you know, panda bear outfits in swords? But that's a really strange combination there, but that happens in my brain. But I'd like to be presented an explanation of why the item was recommended to me. So explicitly, you can like something, you can rate it, you can write a review. Sentiment analysis is also another hot topic nowadays. So by reading, somebody was talking about NLP, natural language processing. Now when you have sentiment analysis, natural language processing typically comes into play to figure out if it's a good review, it's a bad review, and how do you tell sarcasm? How do you tell basically colloquialisms to the area where the user that it's from wrote the review. So those things are hard problems. Sometimes I can't even tell other human beings are being sarcastic. You know, and we have this supercomputer in our, in our skull. Imagine trying to do this with a collection of tags and things like that, or just words and tokens. So implicitly, also explicitly, users tend to present feedback that represents the ideal way that they see themselves. While implicitly, users just do. And when you get implicit feedback, you're learning the true nature of that user. Implicitly, it could be at the highest level purchasing something. That means I really like it. I really need it. Maybe I really need it. I don't like it. That's a whole different topic. Searching for something. You can, based on a query, figure out the likelihood that some items and match that query are to be liked by the user. Browsing. How long do you spend on a page? Did you scroll to the page that you read the text on the page? Obviously, all of us have been trained by end user agreements to scroll really fast and click OK. So can you detect that? Do you know if the person actually read it? Those are the things, the problems that at the UI front were actually facing. And then, of course, taking into account positive and negative feedback. So here's an example of a review on Amazon that I forgot that I've done. In the US, we have Halloween, which is kind of a light version of Dia de los Muertos in Mexico. In Spain, I don't know what you have. I know that you guys like to throw tomatoes at high speed at each other. That sounds a little more painful than Halloween. So here's a couple of reviews. This one, for example, the text, it would be pretty hard to tell whether the review is good or bad by a machine. So sentiment analysis will have to come into place. But Amazon is smart enough to basically bundle that with a start rating and also a direct sentiment expression. So in here, you can see that when I bought my Halloween Fidel Castro outfit, it did not go well. The neighborhood in Arizona did not like that I was wearing this, but that's a whole different story over beers tonight. So you can see that I have a title. I have a start rating. I have a sentiment. Here's, for example, a good review. Somebody that I knew wrote a book. And of course, as a good friend, I said, I love the book. I read the first chapter. So that's another problem. We lie. We lie a lot. We lie to ourselves. We lie to computers. We lie around the world. If you could really have a counter, like maybe a bell that goes ding every time you say a lie or a mild lie, a day, you'll be impressed. And as somebody that has a couple of kids, they lie to me all the time. And I said, well, you know, I cannot get on my high horse. And it's like, I just lied to you about something else like five minutes ago. So it's a human nature to aggrandize, to paint a picture of ourselves, that it's the ideal picture, and to basically provide information that it's not very accurate for machines to deal with. So to deal with this type of filtering, you start with a matrix. In this matrix, it's called a utility matrix. And it's a matrix of your users and your items. And on the intersection of those, you'll basically will have either the start rating or the comments or whatever the user has provided for that specific item. So you basically want to use the ratings of other users to find similar users. And the users are items. So there's two approaches. And they scale differently based on your environment. If you have a lot of users and a lot of items, or very few items and a lot of users, that determines which approach you will use. But again, you want to figure out items that are very similar to the items that you like before, and recommend those items. Or items that other users like you have liked before, and you want to also recommend those items. So the utility matrix, you have user item ratings. Let's say that these are star ratings. And the question marks are the users that did not provide a rating for a specific item. Now, this matrix that I'm putting here is an example. It's very densely populated with ratings. What really happens is that you have a lot of question marks and very few numbers. So it is a problematic environment to basically find those ratings. So of those two styles, let's go with the first one, which is a user based recommender. That means find users similar to you, find what they like and recommend those things or things that are like those things. So again, it relies on finding similar users. The similarity it's between the users, it's the items in common. So for example, if you liked a specific book, let's say a Ruby on Rails book, you provided a three star rating. I provided a four star rating. That is a similarity between us. Now, the more things that we rated in common, the more similar we are. So if you can find, basically do the pairwise comparison of all your users. So again, if you have a lot of users, this becomes computationally expensive. You can now have a similarity measure between users. And now you can look at the items that, for example, you liked or purchased what I never rated or seen before. And you can use those to ask recommendations for me. So again, it relies on the items that we have in common to find similarities between the users. So the first example, we're going to take that utility matrix. We're going to turn it on its head so we can have a list of user IDs, products, and ratings. So we depivot that table. And we have user ratings for the different users, for the different items given by different users. And I get question marks for the missing items. And the goal is to predict those question marks. And some systems will basically do this in batch mode overnight. Let's find for probably you're going to classify your users by the high end purchasers, the ones that browse but never buy, the ones in between that buy something from time to time, and then start with the high value customers and figure out customer similarities between those and other customers. So you can start your recommendation list. So in this utility matrix, you can see that we actually use the rows to find similar users. So our row wise comparison, it's what we use to find the similar users. Now, once we find the users, we need to basically find the users. And to find the users, we need to have a measure of similarity. So how do we calculate similarity between two users? In this case, or items, it could be users or items, we need a mathematical way to do that. One of the simplest algorithms, it's called the K nearest neighbors, or KNN. In KNN, it's kind of a centroid type of approach, where if we are looking to figure out what users are similar to this user, we use the K factor, that's where the KNN part of the KNN algorithm comes from, the nearest neighbors. So we're going to use a factor of three nearest neighbors. So in here, you can see that as I draw a circle around those three nearest neighbors, I'm going to basically find these three users, and then I can take the average of what they like and use that to compare them to me or to make recommendations. Now you can see that the classification or the similarity will change based on how many you pick. So this is one of those places where, as I was building systems with some of these technologies, I have to grab that circle and go wide, narrow. Sometimes there might be clusters, okay? And a small number will basically give you a pretty good answer. But as you open that circle, now you're grabbing two clusters of different types of users, and now the comparisons become harder. So there's a lot of different aspects to do this. You can have actually a weighted average of everything. You can have the users then do the K nearest neighbors recursively to a certain depth to basically average the different clusters. So it can get pretty hairy pretty fast. And that's why I like a platform like RapidMiner to explore this. So let's actually look at this in RapidMiner. And what I did for this user-based collaborative filtering, it's to load some data from the database that looks like the matrix that I shown you before, the D pivot matrix. I find the users using nearest neighbors. And I use the neighbors ratings to predict the ratings of the... So to predict the ratings that the user in question would give the items that he has never rated before. So here's the prediction part of the equation. In RapidMiner, it's a graphic environment. So you start again with the big picture. And you want to figure out how to have the pipeline of machine learning processes chained together correctly. And then you can go into the detail of each node and figure out the details of how to make it better. And it takes hours. I thought this was going to be easy. And it's painstakingly slow and difficult because you've got to go into every node and change values and then really keep annotations of what you did. So it's the whole scientific method comes back. You have to... I have notebooks now where I write basically every observation and then figure out, you know, what my control group is and all that stuff. I feel like I'm doing clinical trials for something. So data science, it's not as sexy as they paint it to be. It's a lot of work. So as you can see, I'm reading data from the database. One way that you use in machine learning to basically test whether something is working correctly, it's to get a big chunk of training data, where for example you have things already figured out, and to split those examples into training data and test data. So you train your algorithms with some training data. In this case, I'm going to use an algorithm that it's based around the KNN algorithm and compares users in that utility matrix. As long as you provide the matrix in the right shape, it would do the right thing. And then I'm going to evaluate the predictions. I'm also going to test the performance or the recall quality in the accuracy of my classification system. And I'm also using the test data to predict ratings once I know that the system has been trained. Okay, so let me quickly go to RapidMiner. And in RapidMiner, I have a repository. And in here, I'm going to open my user base recommender. And this is what you saw on the slide. The thing that gets interesting, for example, if I click on the user KNN, you can see that there's some parameters in here. So for my, that it's not what I was expecting. Thank you. I was just about to freak out. When you're here and the demo gats don't smile on you, you're like, oh, no. Okay, going good. I have actually higher resolution. All right. So in here, if I click on that user KNN node, you can see that I have a K value of five. And I try to basically play with all these values a few times to figure out how to get the best results. Also notice that there's all kinds of choices on how to calculate the correlation. So when you have these two vectors in space, you can use Pearson correlation, or you can use cosine correlation to figure out how similar two things are. Obviously, the cosine similarity is one of the most used ones and relies on basically calculating the cosine between two multi-dimensional vectors. And we'll talk about that in a few minutes. Another thing to look at, it's this performance node. And in the performance node, you can see that I have some performance matrix. So for example, here's the the rating range that I'm going to use. And I also have a way to apply the model so I can evaluate the predictions and also apply that same model to the test data. Notice in here that the model that was created basically it's passed from the user KNN to this evaluate predictions. In every node in RapidMiner has another output node that you can use to chain that model somewhere else to use it. You can also persist the models to disk. So now you have an already trained model that you can retrieve and actually played with. And this was invaluable to be able to understand some of the systems. So let me go ahead and run this so you can see what's going on. And I have a small set of data, obviously I couldn't pick a lot large sets of data for the demos because otherwise we'll be here all night. As you can see in here, we have our example set loaded, which has the user IDs, the product IDs and the ratings. We also have the test data and we have the performance of our system. So that's the root mean square error, which is a measure basically how close you are to basically success. So with RapidMiner you can really do some of this stuff in a very quick fashion. When you drop the components, some of the default values are the default values that the overall data science community would pick. So you get a baseline for behavior and then you can start tweaking things around. Alright, so one of the issues with the, so you notice that there was no empty values there anymore. All the question marks were replaced with ratings, but those ratings in the test data were predictions. But now we can use those ratings to find the highest rated items and recommend those to you. So the problem with this is that it performs poorly if you don't have enough ratings, obviously. And like I mentioned before, that matrix is typically pretty sparse. There's only a few ratings used, customers don't like to be told to do something. So you will have very few ratings or sometimes they go to the extremes. Either I hate it or I bought it already, so I might as well give it a five. But sometimes people don't take the time to think in the middle range. So that's one of the problems with user-generated ratings. Sentiment analysis is better because the only problem is that typically super happy or super angry customers leave reviews that are text-based. And the other problem is that these are computationally expensive to calculate the pairwise comparisons between all users. And if the user changes, you have to recalculate. So for example, in my Amazon recommendations, there are still Java books that are out of print. I'm getting a Bruce Eccles book, which I remember I used to carry around in the 90s. I probably use it as a doorstop now or to be attackers with it. But it's still coming up in recommendations. It's like, you might want to buy this. It's like, not really. All right, so that was one, the older type of collaborative filtering. It's the item-based recommender. And the item-based recommenders are actually very similar. The big difference is that we're actually now finding similarities between the items rather than the users. In that matrix, when we do things, we do them now column-wise. So we use the columns to find, as a template, to find all the columns that are similar. And now we can know the similarity between items. So they're very similar approaches. In RapidMiner, it's pretty much the same. You just have a different component that basically gives you this implementation. Now, building a recommender component like that item K and N, or that user K and N, it's pretty extensive internally. So this is how what I use at the high level to basically get an idea of what other things I need to do correctly. How to clean my data. How to get the data from the right places. And now I can maybe replace one of these components with my own implementation. So we are at 41, so let me move faster through what we have in here. We don't have much left, but I'm already at the borderline. So, advantage of the user-based filtering is that items do not tend to change. Users change. A book that was written is that book. And maybe opinions about the book will change over time. Some books that we thought were great are not great anymore. Remember the J2E patterns book? Yeah, I probably burned some millions of dollars of some employers with that book. But let's talk about that over beers. Smaller data set. So dealing with a smaller data set, it's actually easier and it's faster to compute and you don't have to recompute as often because items do not tend to change. So the content based recommendations is the ones that really excite me. And these are the ones where you actually do the machine learning to basically go, turn the problem into a machine learning problem and classify items according to their content. You also classify users according to their content. How do you get the content for the users? You have to profile them. There's a lot of different ways to do it. Demographics. For example, you can do age, geographical location, things of that sort. And then also, if users provide that information and they rate items similarly, you use collaborative filtering to figure out what their tastes of those other users are and try to test those in the profile of the unknown user. So there's a lot of different ways to actually mix the two. A good example of a content based recommender system, it's LASTFM, because it really analyzes the music deeply. They figure out basically beats per second, the level of the volume of drums versus the guitars and things like that. And they use that information, very fine grain, to create multi-dimensional vectors to deal with this recommendation process. Pandora, it's another big one in there, IMDB. But again, the idea is that you create a predictive model out of the information that you have, filling the blanks that you don't have, maybe other users' information with collaborative filtering, and the profile of a user becomes a classifier. So I'm going to skip forward a little bit. We're going to skip the vector space model. Well, I'm going to give you the last slide in here. So I wasn't going to basically go really into vector space model, so you get a little bit of the math behind this stuff. But again, think about vectors in multi-dimensional space, and you want to figure out how similar they are. So two that are here are very similar. These two are the similar. These two, I don't know, and I'm doing the YMCA dance now. So understand thy data. That is the first thing about building recommenders. Data cleansing, it's an arduous job. Sometimes I spend more time cleaning the data and figuring out where to get the data and how to enrich the data than actually processing the data. In more data, it's always better. Peter Norvig, which is the head of AI at Google right now, will tell you that simple models will trump complex models if you have more data. So whoever has the most data typically wins. That's why Google is still winning. Again, at the bottom of all this, there's classification algorithms. And I'm going to also now move again. In the machine learning world, there's a lot of different algorithms. So for example, when we do deep learning, we're talking about neural networks. Shallow learning used to be like two or three. You guys remember perceptrons in school? It would be like a three-layer neural network. Deep learning now has hundreds of thousands of layers of artificial neurons in between. One of the hottest topics right now, it's random forest. So for example, you create a forest of decision trees and then you randomly, you create those randomly and then have them vote to see what class a specific piece of text or a specific disease, let's say, description fits into. Obviously, my timing was not ideal. But the fun part of this presentation, the fun part to me, it's going to be okay to you, it's that I actually decided to use world knowledge to enhance my classification. And what I did, it's use Wikipedia to basically create a corpus of knowledge about a concept and then be able to grab specific article and try to classify it with that world knowledge. So the idea here is that by having something like Wikipedia which is curated and updated sometimes by quote-unquote experts, although I've updated articles there and I'm no expert. But the idea is that now you have really good training data. And having that fine-grained semantic representation allows you to build better classifiers. Also, Wikipedia provides a taxonomy that's built in. When you look at a Wikipedia page, you have basically the parent topic, the children topics, and you can really use that information to create features that are more relevant. One of the problems with most of the classification examples that I did is that I had thousands of features and some of them were irrelevant. So for example, when you process text, you tokenize and you get words out of the text, you remove stop words, you use dictionaries to clean it up, but you still end up with a lot of words that do not provide any value and just add to the computational burden of the algorithms. So one of the things that I did is basically create a crawler to go to Wikipedia for specific topics, get that information and process those documents to create vectors in this hyperspace. Now, once I got the text out of those Wikipedia articles, I did 150 examples just for this in 18 test pages. So it's a small data set. When I did it with more data, it worked better. I trained the simplest classifier, which is a train, a KNN classifier, and I classified a collection of web pages. The process, which actually fairly simple in RapidMiner, it really gave me an insight into how things work and should work. I stored that model so I could reuse it later, and then what I did is I stored multiple versions of the model, and then I did a gigantic big for loop to basically run the different models that were fine-tuned differently against the same data programmatically, so I could really get more information about it. Now, the data set that I used for the example in the presentation came with an amazing recall, 100% accuracy for this one class. So if an article was a Java-related article, in 99.4% if it was machine learning. So once I did this, I thought, man, you are a data scientist. You are good at this. You drag and drop things, you connect them, and it just worked amazingly. Wow, I'm a genius. I need to send my recipe to Google. Well, well, well. I ran this whole damn thing on a small sample of set data and noticed my Java articles were classified correctly. So this is the category that they actually are, and this is the prediction. And then it started failing around here. In all my machine learning topics fail. The problem is that when you have a taxonomy, building a taxonomy requires a lot of curation. So how far apart are those concepts? Though the concept overlap sometimes. There's a lot of machine learning libraries in Java, and some of those articles in my training data set made those two groups overlap. And I found articles that had more text in the machine learning than in the Java world which made some of the terms in the Java world more important. So it screwed up my whole data set. And this is where I said data, understand the data, understand your taxonomies or full taxonomies, and that's how you will basically get by in machine learning. Otherwise, you're going to pull your hair and scream at your computer. I did not do that. That's just a web search. I typed bullet riddle laptop. So again, overfitting classifier parameters need to be tweaked. Try different classifier algorithms. I tried a bunch of them. At this small data set with the training data that I provided, they're still not great. So there's some data engineering that needs to happen ahead of time. So again, why would you build one of these things? It's not really a product. It's a feature. So in your applications, really don't build recommendation engine after I told you to build one. I want to build one as a product. But if you're a web developer and you had an e-commerce, let's say you're Shopify, Shopify might be able to build their own and integrate with the product. But if you are putting all your own e-commerce store, don't spend three years building the recommendation system in six months building the store. Okay? Find somebody to actually do this for you or to help you. So again, more data beats clever algorithms, but better data beats more data. And that's what I ran into. I need better data and I need a better classification of my environment. So once I went through all this, I came up with an architecture that I think makes sense for what I want to build. And I built some pieces of this. So I started basically building this content analyzer first because this was the fun stuff to do. But then I realized that, remember that circle in the middle of the machine learning squares that said feature engineering or feature extraction or selection? That is the hardest part. So what I'm focusing right now is on building this taxonomy, folksonomy API that allow you to manually build that higher key or that list and also assist you in building that list. And I think this is where I lost the battle in this example, but this is where I would make a product like this useful to people. This stuff, it's easy. Most of it is can. You can eat most of the commercial vendors. There's three that I found that provide recommendation engines use the simplest algorithms. And they take most of them just take data from the system as user actions. User browse this, user purchase this, user rate at this and do collaborative filtering and don't provide any content-based recommendations. So my idea is to mix the two but also provide the hardest part or provide assistance to build the hardest part which is the feature engineering. All right. So now go and learn machine learning. Fine. Yes, this is me wearing the infamous Fidel Castro outfit. My wife begged me not to put this picture on the slides but hey, in case something went wrong, I need itself deprecating humor to basically win your hearts or I'll buy you a beer later. The pragmatic programmers, they're one of our customers and they agreed to let me use their data for some of the examples but they also provided a coupon for books. So if you use that just don't tweet the coupon first because otherwise people will abuse it but use it. It's a full stack fest 2016 20% until September 30th. Thank you.