 My name is Ian Clark. I'm the CEO of Uprising Labs. I'm also the founder of the FreeNet Project, an anonymous peer-to-peer network that I've spoken about here more times than I can count over the last 10 years. This has nothing to do with that. Let's talk hacking desire, figuring out what people want. Why would anyone want to do that? Well, this is really fundamental to the internet in many ways, to getting the right information to the right people. It's essentially what companies like everything from Google to Facebook, what they're all based on, figuring out what people want and providing it to them. My background is in artificial intelligence, so I was very interested in taking that type of approach to addressing this issue. I'm going to be talking about this kind of broad area and some other approaches to this problem, techniques that are generally called collaborative filtering or recommendation engines, which is kind of the umbrella term for all of this. I'm going to talk about how they do it, what's wrong with the way they do it, and present to you a pretty neat approach that I've been working on for the last while. What can you do with this? Well, we actually see a lot of examples of recommendation engines in use on the internet today. You can use it to figure out what kind of music people like. This is an interesting problem because typically people have a lot of trouble articulating exactly what their musical tastes are. The idea of some algorithm that can look at your behavior and essentially reverse engineer your music preferences is pretty attractive. I'm sure most, if not all of you are familiar with Netflix and that's one of the most prominent uses of collaborative filtering online. They use it to recommend movies. This approach can also be used to recommend search, for example, as an alternative to using something like PageRank to rank search results. You could use a recommendation engine like the one I'm going to describe to rank search results in a way that is tailored to you specifically based upon your past behavior. These techniques can and are being used to target advertising, behavioral advertising. Some people kind of see that as scary. I think it's kind of a double-edged sword on the positive side. I think most people, if you're going to get advertised to, it's better that it's actually relevant to you. These techniques can and are being used for dating websites, finding people who are likely to form compatible matches. I've been doing quite a bit of work with a company called eHarmony that most of you are familiar with who have done a lot of research into this and they're kind of continuing to use this type of approach to figure out what pairs of people are likely to achieve long-lasting relationships and in their case hopefully get married because that's basically their raison d'etre and also product recommendations which is similar to advertising. What are the common approaches to this problem today? The simplest one is what I'll refer to as item-based collaborative filtering, a little bit of terminology. Typically the people are users and the things that are being recommended, whether it's music or movies or whatever, are called items. An item-based collaborative filter basically looks for similar items typically on the basis of a lot of people who bought this thing or liked this thing also liked this thing and then if you express an interest in a certain item it will recommend other items that it thinks are similar to that. This is very simple to implement. It's also easy to explain to end users. It's pretty transparent what's going on. On the negative side it's quite naive. Really the only input to this algorithm is what item are you looking at right now and that's all it knows about you when it makes its recommendations. There also tends to be a problem here where you have limited diversity. For example, when this technique is applied to things like news stories what it will tend to do is let's say you have a news story about the war in Iraq for example it will recommend a bunch of other news stories on the same or similar topics and that's typically not what people want. People want some degree of diversity especially if they're reading news or listening to music. A more sophisticated and more powerful approach is user-based collaborative filtering. In this approach you look for users that are similar and the way you find users that are similar is you look at what did this guy rate, what did this other person rate the same things and you look for users where that is similar and to those statisticians among you typically you'll use a Pearson correlation coefficient to determine the similarity between users. On the upside this can actually develop quite a nuanced... and then of course once you've identified the users that are similar you look for things that they liked that your candidate user hasn't seen yet and that's what you recommend to them. Now this can develop quite a nuanced view of what your interests are if it can find 20 or 30 other people that are similar to you and then use those other people essentially as recommenders for you then that can be very, very effective. It's also quite easy for end users to understand people like you like these things. It's not hard to get your head around. There's some problems though. Firstly it requires a lot of data per user to accurately determine user similarity. If you and I have just been using Netflix for even a couple of weeks we've maybe rated four or five movies each. How many of those are we both likely to have rated? Well at most one or two really. One or two ratings of the same movie, the fact that I gave Robocop a five and the fact that I gave the Matrix a five star rating and the fact that you did two, you know okay we're probably both geeks but it's not a very nuanced perspective of how are we similar. So these things require a very dense data set to be effective and in most situations you really don't have a dense data set because most people are not Netflix, they're not Amazon, they're not Google and they need to make do with a lot less data. Also this approach can be hard to scale and naive implementation of it where you check every user against every other user has order of n squared scalability which if you have millions of users that's a very very very bad thing. Typically approaches which try to make this algorithm scale more effectively reduce the quality of its recommendations significantly. So our approach is different to both of these two things. In some ways it shares some properties with both of them. Our idea is we want to build a mathematical model of your interest, what you like and what you don't like and simultaneously we build a mathematical model of the qualities that particular items have and for any user an item we can combine these mathematical models and it will spit out a prediction of how much that user will like that thing. So we start out obviously these models start out essentially being random but what you do is you take past data, so user A liked item B this much and you feed it into this model and you kind of make tiny tweaks to the model to try to so that it's producing the appropriate output on the training data. This technique is called gradient descent and I'll talk a little bit more about this in a second. And then we use this model that hopefully is now giving fairly good predictions on the training set, the data that it has seen and it's able to produce predictions of how much a given user will like a given thing that they haven't seen yet. So this broad technique is something that's used quite a lot in artificial intelligence research. It's called gradient descent and the way to think about gradient descent is your current solution to a problem is like a ball on an uneven surface like this one here and the lower down the point on the surface is the better your thing is doing. So in the case of our application the more accurately our algorithm is predicting user behavior the lower this ball is going to be on this surface and then gradient descent you simply put the ball on the surface initially in a random position and it will roll downhill and what's going on is that at each stage you're essentially figuring out what slight modification can I make to the properties of this mathematical model such that it will yield better results. And you do this millions of times in the case of the data set I've been working with mainly using 100 million ratings and you repeatedly go through them over and over again and for each rating you kind of modify your model ever so slightly so that it's giving output that's closer to what you're looking for and eventually it will start to produce results that are pretty close to your training set. I'll wait. I don't know if we're supposed to evacuate or I'm guessing not. I assume it's going to stop any second now. Okay. At least this is also happening to all the other speakers that makes me feel better. Okay. I'll try and continue on and ignore it. So gradient descent has pretty diverse applications as I was mentioning it's something that's used quite a lot in artificial intelligence whenever you have a situation where you have a problem to solve you're quite sure how to solve it and you'd like a computer to figure out how to solve it on your behalf gradient descent is a very effective way to do that. This is not real. Take the... They haven't switched it off yet. Okay. So to kind of give an example of the type of diverse applications that gradient descent algorithms can be used to solve here's one example. So in this case these guys simulated a simple humanoid figure existing in three-dimensional space and they simulated basic rules of physics like gravity and friction and the fact that two things can't be in the same place at the same time. They then used a neural network to control the humanoid figure. For those of you not familiar with it a neural network is a technique that's used in artificial intelligence quite a bit. It's a network of neurons which it's kind of like a loose approximation to the way the brain works and you can use a gradient descent algorithm to train a neural network to do stuff that you want it to do. So in this particular simulation the fitness of this thing in other words the point at which this ball is at its lowest point is where this humanoid figure achieves its maximum height. So this gradient descent algorithm is going to try to figure out a way to make this guy jump as high as he possibly can and you'll see, I'm going to play a video of this, you'll see that actually he does improve with time as the gradient descent algorithm does its thing. And there's... Okay so here you can see the humanoid figure and I found a little bit of a soundtrack that seemed to go with this video so let's see if it works. It's getting better. Okay so hopefully you could see that it kind of started out being pretty dumb flailing around but by the end of it he was getting some pretty serious air time. So how does... So this mathematical model that we want to create that's going to represent what people want and what properties the items that they want have how are we actually going to do that in practice? Well the approach is pretty simple so in this case I'm taking an example of a movie I quite like called Robocop and what I've done here is I've broken down my movie preferences along essentially four dimensions. You can see here that I quite like action movies, violent movies are okay, I really like sci-fi movies and I do not like romance. So Robocop, it's got a decent amount of action, a lot of violence, it's definitely sci-fi and there is no romance whatsoever. So how do we take my preferences and Robocop's qualities and turn it into an actual number? Well it's pretty straightforward really. We give each of my preferences a name A, B, C and D for the four preferences you saw there. We give each of the qualities of the movie their own names E, F, G and H and the rating is very simply A, you know, how much I like violent movies which could be a positive for a negative number, multiplied by E which is whether or not the movie is actually violent or not which can also be a positive or negative number. I multiply each of these pairs together and the sum of them gives me the rating. So in the case, I can just flip back here or try to... Sorry, I lost my mess pointer here. Okay so you can basically see action multiplied by action leads to a high number, romance multiplied by romance because they're both negative also leads to a high number so add them all up and I like Robocop. Okay but there's a question. So I just... Here's being a little bit slow. Okay. How do we actually figure out for a given user how much they like violence, how much they like this and similarly for movies, how do we figure out what the properties of the movies are? You know we can't exactly laboriously go through each movie and assign these things to them. That would take a very long time and we definitely can't go through every single user and kind of ask them how much you like violence, how much you like this, how much you like that because users just probably can't be ours to answer that many questions and secondly the answers they give probably won't be very good. So how do we assign numbers to each of these values such that we get accurate ratings? And that's basically where the gradient descent algorithm comes in. Similarly, how do we decide what the important features are? So I kind of came up with violence, sci-fi, romance, you know, but we don't really want to be in the business of having to try to figure out what the important qualities of a movie are that will dictate whether or not somebody likes the movie. So what we do is we actually don't need to figure that stuff out. We can just let the gradient descent algorithm tell us. So how does that work? Well, essentially what we're giving the gradient descent algorithm is a bunch of movies, a bunch of users, and a set of ratings that match each of those users to each of those movies. And we basically tell the gradient descent algorithm okay, you've got, actually I use 64 features typically. You've got 64 features. They can mean whatever you want them to mean. You just need to assign numbers to each of those user features and each of those item features such that you are able to distill that user's interest down to a very basic, you know, down to something that will yield accurate predictions. So how do we decide, oh, wait, okay, so yeah, and so what will happen is this thing will figure it out for itself. We probably won't even understand what these features mean, that it's figured out, they will just make sense for it, but kind of later on I'll show some visualizations where you'll hopefully get a sense for what it tries to figure out. So I've done this, I'll show you kind of what it looks like a little bit later, but does this approach work? Well, so I think about two years ago, a year and a half ago, Netflix wanted to stimulate innovation and research in this field. And so they did a very good thing, they took several years' worth of the data that they had collected, the ratings by users on items, and they released it onto the internet, 100 million ratings, half a million users, 17,000 movies, and they basically said to people, take this data, you try to build a piece of software that can predict what users are going to rate, and then we will test what you've built on a test set of data that we haven't given to you, and if your algorithm performs better than everyone else in a given year, they'll give you, I think, 50,000 dollars. No, if in any given year it's 50,000, and if you can achieve a score below .85 something, then they'll give you a million dollars, and nobody has done that, and it's actually quite likely that nobody will be able to achieve that. But I'll talk a little bit more about why that is and what the properties are of the way they've chosen to measure success of these type of algorithms. Basically what they use to determine how successful this type of collaborative filtering algorithm is is a metric called root mean squared error, and that's pretty simple. You train the algorithm up on a training set, you then give it a probe set which is combinations of users and items that the algorithm has not seen that it's not been trained on, and then you let the algorithm make a prediction for that given user an item, and then you measure the difference between what that user actually rated that thing and what the thing predicted, and then you take the mean difference between what was predicted and what the user actually did. It's slightly more complicated than that because they use a root mean squared error, and all that means is that they actually take the squared differences between what was predicted and what the users actually did and get the mean of that, and then you take the square root of the whole thing, and the reason that they use this type of mean is they want to really more aggressively punish very, very bad predictions. So if you predict that a user's going to love a movie and they actually hate it, that's actually a very, very bad thing, and by using root mean squared error, this will lead to a significantly higher error, and so you're punished for just really dumb predictions. And then as I mentioned, it's very important to separate out the probe set the data you're actually going to test the algorithm with from the training set. If you look at the forms around the Netflix prize where people kind of discuss it, what happens is about once a day you'll see somebody appear on there. It's like, oh my God, I just won the prize, and invariably the response is, are you testing your algorithm on the training set? And typically when you do that, you get root mean squared errors around 0.65 or 0.7, which would mean that you'd just become a millionaire. But you have to separate out the data because the algorithm will essentially cheat, typically speaking, there's kind of a degree to which it will memorize the data that it sees, so you can't test it on your training set. So how do we do? So the approach I just described scores 0.905 on the Netflix probe set. That's about 5% better than Netflix's own algorithm, which is approximately 0.95 something. But some algorithms actually do significantly better than that. They can get down to 0.864. How are they able to do that? What improvements can we make to get better predictions, and do we want to? Well, before I talk about that, well no, I'll talk about that now. So the answers to these questions are typically how do they do that? In some cases they kind of do it by exploiting a flaw in the way that Netflix are measuring success. To give an example, let's say your algorithm predicted things very, very well for high ratings. So if a user was going to like a movie, then your algorithm's actually very, very accurate at predicting that type of thing. But let's say your algorithm is really bad at predicting stuff that you don't like. So if you're going to rate something a one or a two, maybe it's producing three or whatever, but it's just bad at predicting what you'll rate on movies you don't like. Let's take another algorithm which is okay at predicting movies that you like, and it's okay at predicting movies that you don't like. Well, the way that most collaborative filtering algorithms are used, the first algorithm would be much better, because the way most people use collaborative filtering algorithms is you generate a bunch of predictions for a given user, you take the top 10 or top 20 predicted ratings, and you present those to the user. You never present movies or items with low predictions to the user, so it really isn't terribly important how well your algorithm does in terms of its accuracy of predicting low ratings, but it's very important for high ratings. Now, the root mean squared error metric will actually prefer the second algorithm, even though in practice the second algorithm is going to do a worse job. There are other things as well, so one of the things that a lot of the approaches to the Netflix prize, one of the things they do is they'll look at user mood, so they'll look at other movies that you rated today, and they'll look at other movies you rated today, and they'll try to figure out, is this guy in a good mood, maybe he's drunk or something, and he's rating everything pretty high, and then they'll raise their prediction for you because they've decided you're in a good mood, and thus they will actually achieve a better root mean squared error, but in practice this has no bearing on the actual effectiveness of the collaborative filter, because the only thing that matters is your ratings for you relative to each other. So this is kind of another way that you can do better on the Netflix root mean squared error metric without actually yielding a better result. So I kind of said I'd talk about what the algorithm actually did, so what I did to try to visualize or create a visual representation of what this algorithm is doing is I set up a run of this and tried to train it using just three features for both users and items, which is not very much and really doesn't perform very well at all, but it has a nice property that if you have three floating point numbers associated with every movie, then you can graph those things in three dimensions. And that's what I've done here. So each of these dots is a movie. I've also labeled some of them so you can kind of get a feel for whether this algorithm's perception of what a movie's qualities are have any relationship to what the actual movie is about. So in this case you can see that violent movies are red and they're actually spread. There isn't really any obvious clustering with violent movies. You can see that they're kind of spread throughout the field completely. Movies about Vietnam, which just turns out there are a hell of a lot of movies about Vietnam, relatively, and you can see those in dark green are kind of higher up, so there is some degree of clustering, but it's clear that it's not like neat little... There isn't a neat little violence cluster. There isn't a neat little Vietnam cluster. So it kind of shows that the features that the algorithm determined would be effective for the three features that the algorithm determined would be effective probably don't really map directly to the types of classifications that we would apply to movies. So this is something I already talked about, actually the kind of flaws in the root mean squared error metric. So I'll just skip this. But basically root mean squared error, it's okay, but its applications to the real world are quite limited. One way in particular that root mean squared error is not effective is let's say you want to use this type of approach to do targeted advertising. Now if you're trying to pitch somebody on an algorithm that will improve targeted advertising, the first question they will ask you is what percentage increase in revenue will this yield? What's going to be the increase yield from this? You cannot determine that from root mean squared error. So if you kind of say hey, well the thing scores 0.905 on the Netflix data set, absolutely meaningless to advertisers. And that's actually true of many, many situations. It's a very academic metric but academics seem to really like it despite its flaws. So what are the alternatives to root mean squared error that Netflix could and probably should but almost certainly won't use? The simplest one that I've come up with is models the way you actually use these algorithms quite closely. So what you do is you rank the items for a given user, you rank the items by what your algorithm predicts that, how much your algorithm predicts you will rate those items. You set a threshold so you kind of have a cutoff at the top 20 items and you look at, if any of those items are in your probe set, the kind of thing you've set aside that you haven't trained the algorithm on, you look at the average actual rating of the items that find their way into this set of top 20 items. And then that is actually a metric that you can translate directly into, you know, if you're talking to advertisers, for example, this is the percentage increase you're going to get if you use this type of approach to target your ads to your users. So where can we take this in the future? And actually a lot of this I've kind of developed a lot further than what's described in this talk. So I'm kind of talking about the future but this is stuff that I'm doing right now. Can we incorporate more data? Are there ways that we can take more diverse data and use it to make better predictions for users? Well, the problem with most collaborative filtering algorithms is that the only form they accept data in is user A liked item B this much. But there's a lot of information available to you about both users and items that are not in that form. For example, let's say you're operating a website, let's say you're Netflix and you're operating a website. There are a whole load of things that you can determine about users based on even the HTTP request coming in. So you can look at their user agent string, you can figure out, you know, are they a Mac user, are they a Windows user, do they use Firefox, do they use Internet Explorer. All of this metadata potentially holds information that you could use to more effectively predict what this user is likely to be interested in. You can also, the user's geographic location, many other things are available to you right from the very first moment that user visits your website. Why can't you use this stuff? Well, actually, we did figure out a way to use this information. So the software that we wrote, you can tag both users and items. So if you know a user is using Firefox, you can give them a Firefox tag. If you know a user is using Windows, you give them a Windows tag. If you know they're in Las Vegas, Nevada, you can give them a Las Vegas, Nevada tag. It will then look for correlations between these tags and actual user behavior. And you might ask, well, what does the fact that I use a Mac really say about me? Well, it turns out that in some of the experiments we've done, for example, one in a online news website, we discovered that Windows, or rather the algorithm discovered that Windows users or Mac users are 20% less likely to click on a news story about religion relative to Windows users. So, because Steve Jobs is their garden, they don't believe in any other, I assume. But, you know, and we found all sorts of crazy correlations like that. You know, between whether they use Yahoo Mail or it turns out Yahoo users are a lot more likely to click on anything relative to Gmail users. They cannot see a hyperlink without clicking it. And as a result, like some, I do work with targeted advertising companies that send email. They claim they're not spammers because it's targeted opt-in, blah, blah, blah. But what's amazing is that for almost all of them, approximately 80% of their users use Yahoo Mail. Which is, and they don't really know why. They think maybe it's just, there's something about the Yahoo Mail user demographic that makes them very gullible. So anyway, all of this metadata is potentially very useful, and we figured out a way that we can give this metadata to our algorithm, and it will identify correlations in user behavior using this metadata. And that allows it to bootstrap much more quickly. And this is critical for many applications, because as I said at the outset, one of the flaws specifically with user-based collaborative filters is that they require a very dense data set to be effective. But even the approach that I've described, this gradient descent approach, it still requires quite a bit of data about users. And depending on the scenario, you typically don't know a lot about most of your users. If you're operating a website and you want to use a collaborative filter to show people stuff they're going to like, and the guy who sold you the collaborative filtering algorithm says, yes, as soon as a user has spent 100 hours on your website, will be able to start recommending stuff that's relevant to them, that's completely useless. So you really need a way that you can bootstrap your model of what a user's interests are, and by feeding the algorithm any scrap of metadata you can get about that user, that's a way to do that. The feature model, this kind of... Okay, this mathematical model is pretty simple. It's basically you multiply things together and then you add them all. Could it be more complicated? Yeah, there probably are ways that you could perhaps throw some more complicated formulae in there, some trigonomic formulae in there, and maybe that would allow the model to more effectively represent what users are interested in. This is something on my to-do list, I haven't really done it yet, but it seems like if I could perhaps use a genetic algorithm to come up with a mathematical formula that I then try to train to predict user behavior, I may be able to get better results. One of the other things I've done, and then I think we'll probably have some time for questions, is what about the circumstances in which the recommendation is taking place? I know that if I'm, for example, listening to music, my tastes in the morning are probably different to the type of music I want to listen to at 10 o'clock at night. Is there any way we can actually take advantage of this? Well, actually it turns out that you can, and there's this mechanism that we call circumstance tags where you can basically say, give me some recommendations for this user, by the way, it's evening time where this user is. Or give me a set of recommendations for this guy, by the way, he's looking at recommendations on this page of our website. So you can essentially describe the circumstances that this user is in at that moment, and the collaborative filtering algorithm will try to reverse engineer patterns in user behavior as they vary perhaps throughout the day if you're using time of day as a circumstance tag or perhaps a user's behavior on different parts of your website changes, and you can use that information as well. So lastly, I've got some videos online. If you go to sensory.com, a lot of this research I've described is part of the software sensory that I've been working on, and feel free to check it out, and there are email addresses there if you think of any questions on doing that. But since I think I've got a few minutes, if anyone has any questions, feel free to ask. Right, so the question was, if I can condense it, what if four features do not adequately represent the qualities of a item from the perspective of whether a user may or may not find it, be interested in it, and you're right. So what I find is that the fewer features I give the algorithm, the less well it performs. So I gave it three features to generate that crazy rotating cloud thing with all the dots in it, and that performed very badly. So that performed, I think it was 0.96 RMSE, which is worse than Netflix's own algorithm. As you increase the number of features, the performance increases because the system is basically able to figure out a more nuanced idea of what the user is interested in. And what I find is that I keep increasing the features, and I find that once you hit about 64 features, it starts to kind of level off in the increase in performance you get on most data sets, which suggests that you can do a reasonably good job of representing what somebody is interested in in 64 floating point numbers. And now that's pretty good, because 64 floating point numbers is not a lot of data. So that gives you a lot of flexibility. Well, if you wanted to, you could very easily share with other people, here is my block of 64 floating point numbers that represent my interest. Go and give me stuff that I like. Of course, there are privacy implications to doing that, but from a technology point of view, it's pretty exciting. So at the back, could you just repeat the training time? That's absolutely right. So the time required to train increases pretty dramatically as you increase the sophistication of the algorithm. So when I first implemented this, it took about nine minutes for one training cycle on 100 million ratings, and it would typically start giving good output after maybe 40 or 50 training cycles, which, you know, so a day it's performing pretty well. That's on a small number of features as you increase the number of features, there are several things that slow it down. Firstly, there's just more numbers it needs to multiply and add, and that slows it down. Secondly, the gradient descent algorithm is operating in a space with higher dimensionality. So if you can, the kind of two-dimensional surf, or the surface that I kind of had a rotating picture of, that's operating in two dimensions, but imagine trying to do gradient descent in 64-dimensional space. It just takes longer to find the lowest point on this 64-dimensional surface. Yeah, so you're absolutely right. It does slow down dramatically as you increase the number of features, but the good news is you really only have to train the collaborative filter once, so it doesn't really matter if it takes even a couple of days, and then after that, you really just kind of need to maintain it as you collect more data. So scalability and kind of CPU time tends not to be a bottleneck in this type of collaborative filter, although many collaborative filters are extremely CPU intensive, and that poses a problem. We've got two minutes left. Any more questions? One there? Okay, so the question was what about how people's interests change over time, and I'm not sure what Netflix does about that. Your experience would suggest they don't do anything. My approach to that is my algorithm will forget old data. So basically when it's decided it's using too much disk space, it'll start throwing away the oldest ratings, and there is an extent to which it will focus more on your more recent ratings, but I think there is a lot of scope to examine just how quickly do people's interests change. My approach to that is a pretty kind of brain-dead approach, but it seems to be relatively effective, but you're absolutely right, people's interests do change over time. A lot of collaborative filters don't acknowledge that fact. One minute left, I think. Okay, thank you very much.