 Helo! Thank you for hanging out so late on a Saturday night and there's lots of other things you could be doing in Bernal or at this conference. I appreciate you turning up. So today I want to talk to you about learning to rank. So raise your hand if you've ever used a search engine and not found what you're looking for within the first few items returned. Okay and then perhaps you go on and you amend that query and all of a sudden the results get even worse. And next thing you know you find yourself clicking on that dreaded button to take you to the second page of Google. So contemporary search engines are often good at finding all of the results for a given query but less good at identifying which ones are the best and thus should be returned first. And search engines are everywhere. We're not just talking about Google and Bing here. They're just a subset of the bigger class of search. Looking at this sample of places where you'd find search you can see that good search functionality is going to lead to happy customers, better business revenue. And so the task of ranking documents quickly and accurately is very important. It was at a conference last year when I first started learning about learning to rank and from there I've just gone on to learn a little more. Today I want to introduce the topic to you. I'll tell you the things that I learned about and I'll point you to the resources that you need in order to get started and implement these algorithms in practice. We'll look at the standard machine learning algorithms for learning to rank and we'll look at implementing the algorithms in Python specifically using the XGBoost module. To begin with we're going to outline the problem at hand. So they're all on the same page. So the first thing we have is a query. I'm assuming that this query is text-based. So it's going to be the phrase or the words that you type into the search box. It might be in the Netflix case, Christmas films or Christmas films starring Hugh Grant. Everyone happy? Sweet. In addition to the query we've also got a set of documents. So each document might be a film title, a blurb, an abstract, a list of actors. It might be a web page or it might be a standard issue document. There's likely to be some sort of structure in that document. So I'm assuming that it's got a title, a subtitle and it's probably comprised of sections. There's also likely to be some data specific structure in there. So if we're thinking about web pages I'd probably count the URL as part of the document. So I'm using this term quite generally. The goal of learning to rank is unsurprisingly to rank the set of documents. Any learning to rank algorithm aims to return for a given query the set of relevant documents in order from best to worst. So an age old method for learning to rank would be full text search in a database. So in this case your documents would probably live in a postgres database. And as we mentioned they've got this fixed structure, title, subtitle, paragraphs and so on. So full postgres text search looks at where within the document the terms in the query are. So here I've circled them and then that is used to rank documents from best to worst. Something that's nice about postgres full text search is that the user can do some tuning. So if you know for example that in your case the words in the query appearing in the subtitle is for some reason more important than the words appearing in the title you can indicate that. But that's going to have to be done on a query by query basis and it's going to need some human intervention there. So it's fine for some examples but it's not really going to work on large databases or more complex queries. Another standard ranking algorithm is page rank. So page rank is great for many things and sadly it often downwards pages that have few URLs in them, few hyperlinks and so on. The algorithm also uses wisdom of groups, wisdom of crowds to learn and to make improvements. And I think we can all think of recent examples of where wisdom of crowds is not a good thing and a crowd making a bad decision leads to more people making a bad decision. So what we really want is for an algorithm to be general we want it to be robust and we want it to not need any query specific tuning. So that's why we're going to need to think about machine learning techniques in this case. All of the machine learning techniques begin the same way. For a given query document pair they try to capture the overlap between those two. So this is done by making a feature vector. A feature vector is a numeric vector of values that holds information about those two items. So how do we get this feature vector? Well there's going to be some features that depend on the query and the documents. These are known as query dependent. They might be things such as how many words in the query appear in the document, how far into the document do we have to count before we hit a word in a query and so on. And there's a nice statistic known as IDF or inverse document frequency which is often used to measure these kind of things. How often words appear in different parts of the document. We've then got some query independent features depending only on the document. So these are things like the length of the document, the number of times any user has accessed that document or the number of hyperlinks in the document. So for different applications some or all of these features may be relevant. The final set of features are query level features. These are things like length of the query and they depend just on the query themselves. So actually deciding what to put in these feature vectors is a very hard problem and it could be ten different talks. So for the purpose of this talk I hope we can all just assume that we have a nice feature vector that holds this information and we are good to go. Once we have our set of queries and documents and we've made our feature vectors we're nearly ready to learn to rank. All that's missing is a score which captures the relevance of that document for that query. So this score is going to be user imputed so we're sadly going to need somebody to go through the set of potential training query and features that we have and rate how good each of these items are. So this is a supervised machine learning technique. There are some ways to compute those scores on the fly but we're assuming that someone has gone through and given us a training set and said hey this document is good for this query, this one is not and so on. So the first class of ranking algorithms I want to talk about is point wise ranking algorithms. So they work by taking each document one by one and using the feature vector for that query document pair they go on to predict a relevant score. They predict how relevant that document is for that query. Once this has been repeated over all the documents we're able to order the documents by their predicted relevance from best to worst. Thus we have our ranking and we're done. So the missing piece of this puzzle is the model which is a function that takes us from that feature vector and takes us to the score or relevance. So for the purpose of this first illustration I've just simplified and assumed that we've only got one feature in this case but you can imagine this is multi-dimensional. Now we don't know what the best function, our model, will look like but we don't have to. So machine learning is the process of automatically generating the function given heaps of examples. And there are many machine learning techniques and arguably the simplest involves learning a way to add and multiply these features together to give us a relevant score. So you may remember linear regression from a statistics class and that's all we're doing here. Linear regression in this case is giving us a model and a way to turn our features into a relevant score. And we also have a way to evaluate the quality of the model so we could compare for the features where somebody has gone through and given us the true score. We can compare the prediction that we get from this model to the truth that was given. So once we have our model we can then take in a new feature, pass it into the model and get a score. There's lots of other ways we could have learnt this function so we could get a different model by learning a curve and I'm sure you can think of your favourite regression type to give you a different model. Given that there are a few possible models, how do we compare them? Well, since we're working with a training set where some human has given us the true score, we can just compute the difference between the truth and the prediction. That gives us a way to evaluate our model. So that's fine but that only really captures part of the picture because what we don't care about is the scores. When you go on to Google you never see those scores return to you, right? All we really care about is the order of the document. So here we're going to look at an alternative distance metric we could use which takes into account the ordering of the documents rather than the score. So to begin we take the predicted scores that were given and we use those to rank the documents. I'm then going to replace the scores with the rank. So the item on the left becomes rank one because it was ranked the best, the next one down is two, the next one down is three and so on. Now we know the truth in this case because those scores have been given to us and so I put the truth underneath. This distance metric is just going to be how many permutations do we need to make in order to get the true rank in order. So we're going to go from that top line being in order to the bottom line being in order. One, two, three, right? So a distance metric under this distance is three. So we've seen two different ways of evaluating the model now. One looked at the scores and one just looked at the order of the returned documents. There are two things I want to call out here. So clearly just looking at the scores doesn't answer the question that we care about. We care about the ranking of the documents. But just looking at the ordering arguably isn't perfect because I care much more about the first two documents being out of order than I do the last two. If we think of a searched engine, if we put something into Google, I care much more about the first page being nicely ordered than I do about what's happening on page 50. I hope I never have to see it so we don't mind so much. So the distance metric that is commonly used in learning to rank is called normalised cumulative, normalised discounted cumulative gain and it takes all of this into account. So ndcg, as I'm going to call it, is a value between 0 and 1 and 0 denotes that the list is in a terrible order. It couldn't be worse. That's a disaster. We don't want that. And a 1 is perfectly ordered. Thank you very much. So it takes into account both the order of the documents, the score of the documents, and it penalises incorrect ordering near the top of the list more than it does near the bottom. So in order to compute this metric, we need three things. We need the predicted rank, the true rank, and the true score. And then in general it takes that rank list and it associates a weight with each of these returned documents. So the weights are returned as this ratio and here's an example of how we would compute it with sum it up and get a distance metric. Now in order to normalise it, we divide it by the optimal discounted cumulative gain is what you would have got if the list was returned in perfect order. Normalising is really useful because it enables you to compare across different queries and across different data sets. Another thing that's really nice about NDCG is that you can truncate and get a good result. So for example, we could just say, okay, what we really care about is the top 10 items being in order. That's reasonable. So we just compute it at 10 and we look at the top 10 items and get that result. Okay, so I normally, I motivated the use of NDCG by saying that we don't care about the predicted scores only we also care about the order of the returned documents too. But that point wise linear regression method that we were using before was just predicting a score and that's not really what we care about. We care about the order as well. So taking this idea one step further, the next class of ranking algorithms I want to talk to you about is pairwise ranking. So these capitalise on the fact that we care about the order and they take in pairs of feature vectors and estimate which of the pair is more relevant. So the first pairwise ranking method we'll talk about is rank net. If you care, it's a neural net and it has many more parameters than the regression model that we saw earlier. Hence it's more flexible, but that also means it's more complicated. So during training it takes in pairs of feature vectors and it learns the function so that the document which is more relevant is given an increased predicted score and the one that's less relevant is given a lower score. Now because this model training pushes relevant documents up and pushes poor documents down, rank net is good at getting documents into the correct region of the list and necessarily ordering them well within that region. So for example it's greater ensuring that the top document is perhaps in the top 1% of items returned but not necessarily right at the top. Furthermore this method is working by considering pairwise error. It says it's document A more relevant than document B. If yes we push A up and B down. And it doesn't take into account whereabouts in the ranking that is. So it would push both the top pairs and the bottom pairs by equal amounts. But earlier we said we care less about what's happening at the bottom of the list. So is there anything better that we can do? Well yes, lambda rank does better. So lambda rank is doing the same thing as rank net but it scales the change in the weights and the pushing in proportion to the NDCG. So if we go back to this picture, rank net was pushing all of these documents equally whereas lambda rank aims to give a smaller push to the ones down the bottom because we don't care so much about those. And later we'll see that rank net does much better than, sorry, lambda rank does much better than rank net in practice. So the final class of algorithms that we're going to look at is list-wise algorithms. So we saw point-wise where decisions were made on an item-by-item basis. We saw pair-wise, which trains itself by taking in pairs of documents and making decisions. And unsurprisingly, list-wise is going to take in the whole list of documents at once. My favourite list-wise algorithm is lambda mark and it uses decision trees to score and therefore to rank the documents. So a general introduction to decision trees. They're effectively a map that splits up the dataset at each node based on some feature. So, for example, we might send everything, every document with a feature vector such that feature one was less than 10 to the left and everything where feature one is greater than 10 to the right. And these splits continue until we reach the point where every document is at its own node on its own at the end. So the algorithm itself is learning these features to split on. And in this picture here, we've seen the example just using one decision tree. But in practice, the method is actually using an ensemble of trees. So we have multiple trees and each tree splits on a different collection of features. And so combining them, we're using a model that is much more robust and we get a model that's greater than the sum of its parts. So those are really the three main types of algorithm and hopefully you've got some intuition as to what they're doing. And now I'll just point you to the relevant Python libraries and show you how simple it is to actually implement these in practice. In order to do that, we're going to need a dataset. So the dataset we're going to use is this Microsoft Learning to Rank Web 10K dataset. I've selected it because it's open source. There's 10,000 queries in there and many more documents in there. So it's loads of data to play with. So each feature vector, each document query pair has been given a relevance value between zero and four. That's the score. So four denotes. It's really relevant in zero denotes. No, we don't want that, thanks. And we'll have a little look at the schema in detail in a moment. So there's about 2.6 million rows of data in there. So there's enough for you to mess around and do something dangerous. So if we take a look at one of the rows, it looks like this. So this first number here is the relevance. So they've been labelled by a human. Next, we've got our query ID. So the query ID in this case is 13 and it's preceded by QID and a colon in all cases. And then everything after that is the feature vector. The feature vector is of a form index of the feature and then the feature value. So here, feature number one takes value two and feature number 136 takes value zero. In order to use this data set with the Python library straight out of the box, we want to clean this data up a bit. So we want to remove the colons and everything before it. So we just want the raw feature vector itself and the raw query ID. So I'll point you to a little bash script that does that later so that you can try it for yourself. Now, the first algorithm that we talked about is linear regression. It's always good to have a baseline to compare to when you're using the more complicated machine learning techniques. So here you can run linear regression in Python. There's loads of different ways that you can do it. But what I've done here is assumed that we have our test set, which holds our data. So we set up a linear regression. We tell it it's going to take in the features and it's using those to predict. And we've got the score values. And then you just use the transform function. Nice and simple. All of the other algorithms are implemented in this XGBoost module. So XGBoost is an open source, optimised, distributed, gradient boosting library. Why do you care about that? Well, when we're talking earlier about the parameters of the model and learning the model, what's actually happening is some weights are being updated in those models. So XGBoost is fast and works nicely for this. The way in which XGBoost works is that you pass a parameter vector in. So max depth in the algorithms that were the neural networks, which were sort of sideways, then the depth is the number of layers we have in our neural net or the number of circles we have sideways. And when we're talking about the decision tree, the depth relates to the number of trees that we have. ETA is a parameter that prevents overfitting of the model. So you never want your model to just regurgitate the practice data set that you saw you wanted to be able to do better. And so by tuning ETA, you ensure that you're not just overfitting the data. But the key important bit is this objective function. And that's where you set which algorithm you're using. So rank net is by setting this rank map. Lambda rank is setting rank pairwise. And lambda marked is by setting this ndcg. So you can then go along and train a model using the XGBoost train function very simply. And prediction is equally as straightforward. So I have a little notebook that has some of this stuff in. So you can go ahead and try it out, mess around, and let me know if you hit any errors or you like it. So something nice about the XGBoost module is it lets you visualize these decision trees. So here we see an example of one. And you can see that first it's split on feature 65. So it said if it's less than some value, we'll send the document to the left, if not to the right and so on. It's always nice to be able to visualize what your model is actually doing. And another thing that it does is it tells you which features were most important. So it's a nice sanity check to be able to say, you know, I would expect that the query term matching strongly with the title of my document is really going to have a positive impact. And we'd like to see that. So this tells you how important all these features are. The two features at the top, 134 and 108, you'll have to take my word for it. But 134 is the frequency of the terms and the query appearing in the title of the document. That sounds good. And 108 is the number of outlinks or hyperlinks in the document. Well, that sounds good as well, because we know that algorithms like PageRank do work well for cases when there are such hyperlinks. And finally, we can have a look at the NDCG. So that was our distance metric. And we said that one means that your list is returned perfectly ordered. Zero is terribly. And you can see that as we move through the different models that we looked at today, the algorithms give better search results. So I hope this has motivated you to start learning a bit more about learning to rank. We started by looking at some simple methods that already exist that don't use machine learning. We talked about feature vectors, which are present in all machine learning techniques. And then we went on to look at the different types of algorithms and how to implement them in XGBoost. Please get in touch. So I'm at Salfwatts on Twitter or GitHub. And I'm Sophie at Red Hat because there's not enough women in tech. Thank you.