 OK. Hi. I am Sophie Watson, and I work at Red Hat, and today I want to tell you about learning to rank. So raise your hand if you've ever used a search engine and not found what you're looking for within the first few items returned. OK. Pretty good. Perhaps you've then gone on to amend the query that you originally searched for and the results get worse. Sometimes you might even find yourself having to click on that dreaded button to take you to the second page of Google search results. Well, we're going to be tackling that problem today. So contemporary search engines are often good at finding a whole load of documents or web pages that might answer your question or your query, but are often not quite as good at returning the ones that you really want. And search engines are everywhere. We're not just talking about Google and Bing here, so if you look at this slide behind me, you can see that there's a lot of money and happy customers riding on returning good search engines and search results. So to begin this talk, I will tell you about some non-machine learning methods that are currently used for search and have been used for a while. We'll talk about why they aren't so good and why we might need something else, something more general. We're going to outline the problem of learning to rank properly and we'll look at how we summarise queries that users make and documents into feature vectors. We'll then look at three different machine learning algorithms, explain what they do, how they work, and then we'll go on to see them implemented and we're going to use XGBoost to implement them. So to begin with, let's set up the problem at hand. The first thing we're starting with is a query. So what I mean when I say a query is I'm thinking text search. So this might be something like a phrase like neural networks or it might be an explicit question like how do you implement neural networks in Python? In the movie search case, it could be something like Christmas films or something more specific like Christmas films starring Hugh Grant. Is everyone happy with what a query is? Anyone unhappy? Okay, we'll take that as a no. In addition to the query, we also have a set of documents. So a document here, I'm using that term quite generally. I'm assuming that it's probably got some sort of structure, so we might have a title, a subtitle, and then main text, but it might be something like a film title, a film blurb, a list of actors in the film. There's probably going to be use case specific items in each document. For example, if it's a web page, we would class the URL to be part of the document. So the goal of learning to rank is, unsurprisingly, to rank documents. We want to take that query that the user has given and return to them a list of documents in order of relevance. So a standard method that you might have seen used or you might have implemented yourself is a full database text search. So for example, postgres text search. So the way in which I'm thinking this works is that the algorithm looks for matches between the query and the documents, so I've circled where they are within the document in this case, and then it ranks them based on those. So something nice about things like postgres full text search is that the user, well, the producer of the system can state which parts of the document they feel are most important. So if you feel like for this specific query or class of queries, the subtitle holds more weight than the title, you can encode that in your algorithm and hence we'd get return something that looks like this. Another standard ranking algorithm that is used is PageRank. So PageRank works very well. I think you'll all agree that is what is essentially used by Google, so we probably use it many times a day. There are some negatives with PageRank, so one of those is that it often downweiths articles or documents that have fewer links in them, fewer URLs. Another negative of PageRank is that the algorithm works on using knowledge of a crowd, right? So it grows as the crowd grows, and crowds can make bad decisions, and then other people can latch on to those bad decisions and the bad decisions get worse. Both of these two examples, PageRank and the postgres full text search, require some user input as well. So in general, you need to know the domain, you need to know what's happening. We can use something like PageRank for movies, for example. So, although they do work fine in their specific cases, we want something that's much more general. So that's why we're going to use machine learning techniques. So all machine learning techniques that we're going to talk about today begin the same way. We've got ourselves a query and a document. And what we want to do is we want to capture the information overlap between the two of those. So the way that is done is to generate a feature vector. For every query document pair, we want a feature vector. So this feature vector is a numeric vector. We've got numbers in it. And we need to figure out how we make it. So we can do this fairly automatically. Some of the terms in the feature vector are going to, we call them query dependent, and they depend on both the query and the document itself. So they might be things such as how many words from the query appear in the title of the document, how many words from the query appear in the midsection of the document, how many words into the document do I need to count before I reach a word that's in the query. You can think of lots of examples of things that exist there. There's a common statistic known as IDF, or inverse document frequency, which is often used to summarise query dependent features. So there's lots of methods there that we can use to get those. Some of the elements in this feature vector are going to just depend on the document. They're going to be independent of the query. So they might be things like the length of the document, how many characters are in it, how many words are in it, how many paragraphs does it have, the kind of summary you get when you do word count on a document in word. And the last set, surprisingly, just depend on the query. So there are things like how long is the query, how many letters in it, how many words in it. So you can see that for any query document pair, we can generate a feature vector quite easily. Once we have our queries and our documents and we've generated our feature vectors, we're nearly ready to learn to rank. But sadly, learning to rank in the machine learning setting is a supervised machine learning method. So what we need is a human to go through and give us a score. So what that score is going to say is it's going to say how relevant is this specific document for this query. So there's three types of machine learning algorithms that I want to talk about today. The first is point-wise ranking. So the idea with point-wise ranking is that we take our documents, we turn them into feature vectors, we know how to do that now, and then we go from the feature vectors and we turn that into a score. So we are trying to predict a score given the feature vector. Once this has been repeated over many documents, many feature vectors, we're able to take those predicted scores and then we can rank our documents from best to worst. Okay, so now we know how to rank. All we need is an algorithm that takes us from the feature vector to the predicted score. So I think we should begin with everybody's favorite machine learning algorithm, linear regression. So what we're going to do is we're going to fit a straight line or in general the feature vector is actually multi-dimensional, right? We saw it has lots of different components. So this is going to be like fitting a plane in space rather than just fitting a straight line. Once we've got that straight line that we fit using some training data, we can come along with a new feature vector and we can estimate a score for it, right? Well, why not take it one further? Let's get rid of the straight lines and go to curved lines. So you can think of any type of regression you might want to do that's going to fit something and we can use that to learn to rank. Okay, so we've already seen arguably two learning to rank algorithms. One is linear regression and one is parametric regression. If we've got two algorithms, how are we going to compare them? We need some sort of distance metric that's useful in the ranking case. So I'm assuming we've got ourselves a data set where the human has gone through and scored those documents. They've said this document is relevant, this is a relevant score that I give it, this document isn't, this is a worse score. And because we have those scores, what we can do is we can compute mean squared error on a testing set. So we can split our data into two parts. We might put 60% in the training set and put 40%, the remaining 40%, into a testing set. And so we can evaluate a metric on that. So what we would do is just take the true score given by the user away from the predicted score. So that's quite nice, but if we think about the problem at hand, when you go on a search engine, you don't really care, you never see those scores. You don't particularly care what the score user ranked was and what score you're getting. All you really care is the order in which the documents are returned, right? We want the best at the top, the worst somewhere, not at the top. So I want to talk about a distance metric that takes that into account. So what I've done is I have taken our documents that we had the scores and I've just replaced the scores with the predicted rank. So the document that we give the highest score to, we give that rank one. The document with the lowest, we've got six documents here, so that's six. Now, because we're doing this in a training testing environment, we do know the underlying truth, so we're able to list the true ranks so they're what the human gave in that pre-processing task. And all we're going to do now is we're going to count how many permutations it takes to go from having that top list of numbers, one to six in order, to having the bottom list in order. So we're trying to go from what we were given to what we really wanted. So one, two, three. Okay, so that would give us a different distance metric and that is three. So that's quite nice, but having to swap the first two documents was given the same score as having to swap the bottom two documents. But really what we care about is what's up top. We want to wait the fact that our first two documents are out of order much greater. That's much more negative, so we want it to have a stronger pull in our distance metric than we do the two down the bottom. We don't really care about those. We're not going to look at them. So what we really want is a metric that combines those two methods. We want something that does take into account the score, the value, and it also takes into account the location. And the metric that does that for us is Normalised Discounted Cumulative Gain, or NDCG. So in all learning to rank competitions you might see on Kaggle, they all use this NDCG. So NDCG requires the predicted rank, the true rank, and the true score. And it uses this equation here, so it's awaiting of the true score over the log of the predicted rank in order to compute our distance metric. Now the log on the bottom and the contribution from the predicted rank on the bottom is ensuring that we care more about what's happening up top and much less about what's happening at the bottom of our list. So we can just compute that for each of our documents, plug in a number, and we're nearly done. Now what I told you then was actually just discounted cumulative gain, and that's useful if you only want to check this out once, but if you want to do it multiple times or compare across queries, which we probably want to do, or across data sets, you're going to want to normalise this so that the results are comparable. So to normalise it, we divide by the optimal discounted cumulative gain. That is the discounted cumulative gain you'd get if your documents were returned perfectly in the right order. So NDCG is going to take values between zero and one, and one is great. One means we returned the optimal list, and zero is terrible. It couldn't got any worse. Another common metric that people use is NDGC at K, for example, where K is an integer. So you truncate your list and you only compute it on the first K items. That gain allows for better comparison when you're in production and you have queries that have potentially different number of documents returned. So, for one query, you might get a million documents returned, but for another, you might only get a thousand. So, in that case, you might want to just compute the NDCG at a thousand, and then those two are comparable. OK, so I motivated the introduction of NDGC, our metric, by saying that we don't just care about the predicted values, we also care about the ordering. But when we did point-wise ranking, all we were looking at was predicting the predicted value, predicting that rating. So, pair-wise ranking works harder to put some effort in to think about the ordering of items as well. The way in which pair-wise ranking works is that it's going to take in two documents at once, or two feature vectors. So, I'm afraid I might use the term documents and feature vectors interchangeably now. What we're really thinking about is passing in feature vectors. So, we pass in our two feature vectors, documents, and it determines which one should be ranked more relevant and which one should be ranked less relevant. It then uses that to update parameters in the model to make it better, OK? So, we're thinking about tuning our model here. So, the first algorithm we want to talk about is RankNet. It's a feed-forward neural network, and essentially, we pass in that pair of documents. It passes them through the neural network that we've initialised with some weights, some parameters. It then looks at whether or not the documents that it's spat out were ranked correctly. So, once a document has been passed through the neural network, it has a score associated with it. We can look at which is best, which is worse. And what it tries to do is adjust the weights so that the document that was more relevant becomes even more relevant, and the document that was less relevant becomes even less relevant, OK? So, we're changing those weights in the neural net so that we're pushing good things up and bad things down. When we come to use this in production, all we actually do is throw our data at it, and it gives us our predictions. It is just a model. We can think of it as a black box, but it's good to know what's happening. So, if we passed it the first two items here, it would aim to push them up and down because they are in the wrong order. And similarly, if we passed them the bottom two, it would aim to push those up and down as well. It would adjust the weights so that next time we pass that pair through the neural network, the results would be better. Now, that's OK, but it was pushing the things at the top of the list, up and down by same amount as things at the bottom of the list. And we said we care more about what happens at the top of the list, not the bottom. You're never going to go to the last page of Google, I hope. So, Lambda rank is, again, using that feedforward neural network technology. But the way in which it pushes the documents, the pairs, once it passes them through, is so that NDGC, NDCG, I'm sorry, I keep switching those two letters around, NDCG is taken into account. So, this is what we had before with our equal pushes when we were using RankNet. And what we get now is that the documents at the top would be aimed to be pushed more, so the weights would be updated so that they were pushed more, and the ones at the bottom pushed less. Now, so those were our pair-wise ranking methods, and they were pretty good. They work quite well, we'll see that in a bit. But there's lots of talk about just pushing good things up and bad things down. And what you actually sort of end up with is that all of the good documents are in the top, but they're not very well-ordered within themselves at the top. So, we're going to go to Fulhag and we're going to go from point-wise to pair-wise to list-wise. So, the list-wise ranking method is going to take all of the documents into account at once when it updates those parameters or the weights of the neural network. The list-wise technique we're going to talk about is LambdaMart. And LambdaMart is a decision tree algorithm, so the idea with decision trees is that you start with all your data at the top, you then split on some feature of the data. So, we've got a feature vector, so we might say, hey, if the third element of the feature vector is less than 10, then you go left, otherwise you go right. Okay, so we split the documents, and this keeps on happening until eventually the documents all end up at what we call a leaf node at the end on their own, and a score would be associated with those. Those scores are then used to rank. So, when we're training such a model like this, what we're actually learning, using our testing and our training set, is we're learning on what should we split. And this example here shows just one decision tree, but in practice, LambdaMart is using multiple decision trees. There's loads of ways we could split the data if you started with it all at the top. So, in parallel, it sends off your list of data to multiple decision trees. A specific document or a feature vector will end up at the bottom of, say, these 10 decision trees, and you combine the scores associated with the leaves that your document ended up with to get its score, and then we can do our ranking. Okay, so that's the end of the algorithms. We've seen point-wise, which was our linear regression. We've seen pair-wise, which was the rank net, and then we've seen our list-wise, which is LambdaMart. And that's nice, hopefully you have some insight into how they work now and what they're doing, but what we really need to be able to do is use them. It's machine learning, so we can't just pen and paper it anymore. So, in order to do that, I'm going to need a data set. The data set I'm using is open source. It's the MSLR Web 10K data set, so this was curated by Microsoft for learning to rank, hence the LR. It's been used in many Kaggle competitions. So, the data itself that we've got 10,000 queries, 10,000 questions, and then there's multiple documents for each query. A human has gone through and given each of the documents a value from 0 to 4 for the specific query with which it's associated with. So, a note is it's not relevant at all, and a 4 is it's really relevant. So, the data set is nice. It's already split into training testing and validation sets. So, there's not very much to do in order to get started, except the data is in a bit of a funky format. So, this is what one row from the data set looks like. So, the first thing we've got is the relevance that's given by the user. That's that number between 0 and 4. Next is our query ID, which is written of the form QID, then a call on, then our query ID. And everything else is our feature vector. So, in this example, the feature vector is 136 features long. So, that's our summary of the query and the data set that we saw earlier. And the kth query is written as k colon, then its value. So, we can see here the first query... Sorry, the kth feature has written k colon value. So, the first feature has value 2, the 133rd has value 7, and so on. So, I've written a bit of code that turns that data set just into columnar data. So, we just have... We get rid of the colons and the numbers before. So, our feature vector is literally our string of features. And from there, it's really simple to learn to rank. So, if we're going to do this with linear regression, everyone's favourite machine learning technique, you can see that it's three lines. So, here this code is in Python, and I'm using Apache Spark to do it. So, who here has used Apache Spark or has heard about it? OK, I'm pretty good. If you haven't, when I first came across it, I was kind of shocked. It just does all the parallelism and sorts out sending your code off across cores and distributing it all on its own. You don't have to tell it anything if you don't want to. So, it really speeds up the computation. So, you can see here that we can define a linear model and in order to make predictions, we just click the transform button and we're good. So, in order to do the more complex methods, the pairwise and the listwise, we're going to use XGBoost. So, XGBoost is an open source, optimized distributed gradient boosting library, and when we're talking about updating the weights in the algorithms that we talked about earlier, that's actually done by a process called gradient boosting. So, we don't really need to know about it, but the first thing that we have to do if we want to implement one of these is set up our parameters. So, the first parameter, max depth, is the number of layers in our neural net or the number of splits that we're going to allow in our decision tree, depending on which algorithm we're using. The ETA parameter is something that prevents overfitting. So, whenever you're using machine learning, you never want to overfit your models. If you think back to that linear regression example I showed earlier, if you overfit the linear regression, it wouldn't be linear anymore, but you could imagine a line that would interpolate all of the points in the testing set, and that's going to give a very funky recommendation, a very poor estimate of the score. The next thing we have there is the objective function. So, the objective function is what changes between the three algorithms that we've looked at. So, if you want to run a lambda net, then you set that equal to rank map. If you want to run a lambda rank, you set that equal to rank ndcg. And finally, if you want to run our list-wise method, then you set that equal to rank pair-wise. So, watch out. Once we've got those parameters set up, training your model is, again, completely simple. So, I showed you the linear regression example to show just how straightforward it is. And this is exactly the same. We can just go ahead and predict. And so, if you want to use a search function or you want to type directly into GitHub and check that the ranking results are any good, you could type in that there and you will end up at a Jupyter Notebook that walks you through the methods that I have talked about today. You can implement all of these on that MSLR training set. And you can play around and make the notebook better and put in a pull request. So, something that's really nice about XGBoost is that it has some visualisation. So, what we're actually viewing here is one specific decision tree. So, you can see that at the front, at the top, we split on a feature 65. So, if, for a particular vector of feature 65 was less than this value, I can't quite read, then we'd go left else we'd go right and so on. Another thing that's nice is that you can look at the feature importance. So, for each of those features in our feature vector, you get a graph of how much impact did it have on the scoring decision and therefore the ranking decision that then enables you to go back and look at what those features were and think, does this make sense? It's always nice to have a sanity check with machine learning. So, I can tell you here that the feature which is most important, you'll have to take my word for it, but that was how many words appear in the title of the document and in the query. And the second one was to do with the IDF, the inverse document frequency that we talked about for the first paragraph of the documents. So, that sounds pretty sensible, we'd be happy with those results. And finally, just a little graph to show that the algorithm improves, so the NDCG gets better as you go from using linear regression to rank net to lambda rank and then to the decision tree method rank mark. So, you remember that NDCG was a number between zero and one and one is perfect, we've got everything in the right order. Okay, so we started by defining our problem of learning to rank, we looked at the existing methods and then we talked about how to develop a feature vector. Feature vectors aren't just useful for learning to rank, you'll find them everywhere in machine learning. We went on to look at our list-wise, point-wise, some pair-wise methods and finally we looked at some implementations in XGBoost, so if you wanna get in touch, I am AXOFWATS on Twitter and I am Sophie at Red Hat. Thanks. Hi, thanks for the talk, I found it very interesting. I have two questions, one is more technical and the other is more like, well, I don't know how to characterize it, but however, the first one is in terms of the document features, I didn't get how to extract them because I don't know if you used some kind of word to beg or globe or something like that or some embeddings because the length of a query is not supposed to have some correlation to nothing. I mean, there are different queries which have the same length and also which are equivalent to other queries with different length and I mean, I don't know if that's clear, so I had that doubt. I agree, yes, you certainly wouldn't think and it's probably not the case that the length of the query is going to affect your machine learning model and certainly we could go back to that plot I showed where you saw how important all the features are and we could have a look and the length of the query probably didn't impact. However, I think what I suppose I wanted to show is that we want an extensive list of features and it is machine learning, sometimes you get a surprise about what can have an influence but I agree, I don't think the length of the query has much impact. In general, I'm just throwing lots and lots of features into those feature vectors because it's not very expensive but the ones you would focus on more and you'd have more of are the ones that depend on both the query and the document together. Okay, thanks. The second one is like a more generic one which is I find that ranking systems try to optimise cells for advertising, I mean, and that it can be like kind of optimised through clustering, through different profiles so that the weights of the embeddings of the, I mean, the context changes and it's like more personalised to the different profiles. So I don't know if there's some research on this area because I find that Google and all those engines are trying to maximise like advertising instead of maximising what we get from the information and which information do we get in order to know more? Yeah, I completely agree. So my last project at Red Hat was building a recommendation engine and when I started on this project, I thought, oh, there's a lot of similarities because you're right when things are returned to us in a list now online, it certainly takes into account those personalised recommendations. There is some literature out there. People have been doing work on it and, yeah, it's something that we are thinking about looking at. So drop me a line and I can point you in some directions. No worries. Hi, thanks for the talk. It was really great to see the different approaches. I have one question about how you got the labelled data or some of the question. So you're saying for these algorithms, you can take a previously labelled data set, which is sometimes quite difficult to do because it takes lots of effort to have a human look at the query and the document and come up with some relevant score. Are there any approaches you can think of or know about where you can do this based on user interactions with search results? Of course, the tricky thing is, if you want to change your ranking algorithm, that's going to change the order of search results and that's never been presented before. So it's hard to know exactly how that will perform unless you actually go to production with it. So, yeah. Yeah, absolutely a great point. No one wants to number 10,000 queries worth of documents. Yeah, there is some work out there about how to automatically rate them. You could do things like looking at retention rates when people have actually clicked on that item, how long have they spent on it, and therefore you would use that to score it higher and you could use that score as the truth, as the ground truth. There's positives and negatives. You get back into the crowd mentality and something that isn't the best can continue to be upvoted and so on. There is some research out there, but it's not something that we've looked into yet. Looks like it's it. Thanks for turning up, guys.