 Helo, Helo, OK. So, yes, hi, my name is Alan Woodwards. I'm a Committer on Lucene and I work for Elastic. You can see I've got my T-shirt fully branded. And I'm going to talk about super speedy scoring in Lucene 8. Those of you who are in the previous talk will already have a general kind of an overview of how this works and I'll go into a bit more detail. So, we start off with, we're going to talk about the lowest level of Lucene, how documents are actually collected. We have a class called a SCORER. And SCORER has two jobs. First job is to provide an iterator which will go and find you and iterate over all the matching documents for your particular query. And then it will also give each of those documents a score. It's always in the same order. We're always going in the, in terms of increasing document ID. And as we score everything, then we have a priority queue which says, okay, I want the top 10 matching documents so I have a priority queue of depth 10. As we go through each document, we'll generate a score for it. If it's going to be competitive, we insert it into the priority queue. If it's not competitive, we can ignore it. So, just as an example here, let's say we've got document IDs from 0 to 7. And as we go through each one, we'll say, here's document 0. We know this matches. We're going to give it a score. The score is 2.5. Fine, that's one score of 1.6. 2.5, 0.5 on as we go. And we can say, if we're going to collect, say, the top five documents, then our top documents here will be 0 because that's 2.5 and then 2 because that's also 2.5 and then 1 and then 7 and then 5. Now, if we were only going to collect the top three documents here, you'll notice that we actually fill a priority queue immediately, and then everything else we're going through, we're scoring, and that's just a waste of CPU cycles. None of this information is going to get into our top end. And particularly because even this one at the end here, 1.6, which has the same score as this one here, but because it has a higher document ID and our secondary sorting mechanism is always lower document ID wins, we're still going to ignore that one. So this means that if we want to break things up a bit, then we can skip over large chunks of the index and we can say particularly things like, well, for a constant score, for example, if you say a match all documents query, we could actually just stop collecting after we fill the priority queue, which for a constant score query would just be the first 10 and then you could stop, you could shortcut. That's fine for a constant score query, but obviously for queries that we're scoring using term weighting, we need to use a slightly more complicated mechanism here. And the mechanism we use is we chop things up into blocks. So here we are, we've chopped it into four blocks and then we record the maximum score for each block in the index. So we'll say, again, when we're moving through, we start off here, we collect our documents, we've got 2.5, we have 1.6, we have 2.5, we have 0.5. We know that our minimum competitive score is 1.6. We get to the next block and we know that the maximum possible score in this block is 1.1. That's not going to fit into the priority queue. We can skip straight over. We come to the next one. Sorry, I need to do it. There you go, I can stand my arrow. We come to the next one. The maximum score is 1.6. Again, we know that's not going to fit into the priority queue so we can ignore it and we can keep moving. And in this way we can short-circuit. We don't have to calculate scores. For any of these documents, which will speed things up. So we don't really want to score the actual, sorry, we don't want to score the score in the index because as Diego has said in his question at the end, you might want to change your scoring mechanism. So instead what we can do is we can have a look at how we're going to calculate these scores. Obviously the standard scoring is BN25 in Lucene. But any other scoring that's based on term weighting is going to be, you're going to be able to split the factors that contribute to the score into two parts. You have parts which are global for the index. So things like the total term weight. Sorry, total term frequency. That's the number of times that term appears in the index as a whole. The average document length, which is used in BN25. Again, that's something that's fixed for the entire index. It doesn't depend on which document you're looking at. And then the per document contributions, we have two. We have the document length and we have the frequency of the term. Document length, we don't, in Lucene doesn't actually store the full document length. It stores something called a norm, which is a kind of a compressed version of that document length. Uwe mentioned earlier that the encoding for that has changed in Lucene 8. And the main reason that's changed is so that's now, I think it's done for document lengths from 0 to 10, are completely accurate. And then above that you start to lose accuracy because we're only storing it in a single byte, but it's a long value. But we have this pair of values, it's norm and frequency, and we call that an impact. Now what you can do is you can, just storing the calculated score in the skip lists, we can store this impact, this pair of values. Now, obviously this comes with a trade-off. And the trade-off is that we need the similarity to behave in a certain way when this pair of values changes. So our constraints now are that if your term frequency increases, the score must increase. This is so that if you're going to calculate the maximum score, we're calculating maximum scores for maximum term frequencies. So you need to make sure that those two correlate. And then similarly with the norm, if the norm increases, then the score must decrease. So we're saying generally speaking, longer documents are less useful than shorter documents in terms of the contribution of an individual term's score. So we are defining that, we're going to say, all right, we'll make that an official definition, we'll make that an official constraint on the similarity. If the norm increases, then the score must decrease. And if you have a similarity that adheres to these two, then you can use this new blockmax functionality. So let's have a look at some individual queries and see how this new scoring mechanism impacts those. So a term query. A term query is the absolute basic query in Lucene. And all it does is it has, we have these two, your two methods on the scorer. You have the iterator. And the iterator just basically goes over the postings list for a term. So you have your index. Your index is a big long alphabetically sorted list of terms. And then against each of those terms, you have a list of documents in which that term appears. And if you want to find which documents are matching a term, you go, you look up in the term's dictionary, you find the term you're looking for, and then you just read off the postings list. And that gives you your documents in order. So when we're dividing things up into blocks, we can say, OK, well, we know, given the beginning of this block, we know what the maximum term frequency and the maximum norm, minimum norm, thank you. Minimum norm is we can calculate what the maximum score for this particular block will be. As we're iterating over, you say, OK, give me the next document. You could say, instead of saying give me the next document, you say give me the next document whose score is more than X. And underneath it can go, well, I know, I've got to this block here. I can calculate the, I can extract the impacts from this block. I can calculate the maximum score that this block would possibly give us. If it's not competitive, I can skip over it entirely, go decode the next block. And we already have, as Uwe discussed earlier, and actually as Paul was talking about in his TomTV talk about an hour ago, we already have this structure, data structure in Lucine called a skip list, which allows you to advance quickly over a postings list. So we can just, we can encode this information into the skip list. It's already there. There's not actually a lot of changes that needs to be made to the data structure or indeed to the indexing code. It's kind of already there. There are, as well as not counting all the documents, as well as not visiting all the documents and not scoring all the documents, this also means we don't have to decode all the documents. So when you advance through a block, what normally happens is you get to the next block, it's compressed, so we need to decompress it, we need to extract all the document IDs from this block. But if we know that we're not having to read any of this data from this block because the skip list is telling us that's the scores, there aren't any competitive scores in the block, we can avoid any of this CPU work and lots of this IO work as well. So actually we can get an incredible speed up for term queries. In this particular case, we have a set of benchmarks which run every night against it's Wikipedia corpus with a bunch of fairly sort of standard queries. There's a standard query set and the term query particularly before was running on a nice 33 queries per second, which is perfectly reasonable. It jumped up overnight to 1,150 queries per second, which is a fairly significant increase. I think we can all agree on that one. So that's pretty cool. For constant score queries, as I was talking about earlier, if we're not interested in counting the total number of hits, we just want to shortcut things, we can say, well, because we know that everything's got the same score and because the secondary sort order is by increasing document ID, the first 10 documents that match are going to be the top 10 scoring results. So we can just shortcut and return them straight away. Match all docs query, I don't have actual proper numbers for this because match all docs query isn't part of the standard loosing benchmarks. We did do a, we noticed in Elasticsearch, Elasticsearch has its own set of benchmarks. We noticed that match all docs query had actually slowed down after we implemented all this, the block match scoring. And it turned out this is because it was using a bulk scorer, which didn't take things into account. And once we did do that, we replaced the bulk scorer with the standard bulk scorer if you don't need to collect all documents. Match all docs query sped up by, I think it was something like 4,000%. So that's another useful one. Wildcards and fuzzy queries tend to get rewritten to constant score queries because when you're expanding out a wildcard or expanding out a fuzzy query to all the different terms, it's not obvious how you would otherwise combine the term frequencies of all those matches, of all the terms that could possibly match. So we tend to just replace it with a constant score query and use any other terms that you have in that query to provide the term weighting. These all get, so wildcard query, fuzzy query, prefix query, the general or Thomson queries can all use this mechanism as well. And so prefix query, for example, jumped from 15 to 35 queries per second. That's not as big a gap as the term query, obviously. The reason for that is that for prefix queries, for these queries which are expanded into multi-terms, about half the time the query time is spent actually doing that expansion rather than collecting documents. So these big improvements at collection time don't have as significant an impact for these sorts of queries. Then on top of the term query, obviously the next important one here is the Boolean query. So how are we going to combine all these different terms together? Boolean queries work. They are scored by just summing up all of the matching clauses on a particular document. So the maximum score is just the sum of the maximum scores for its sub clauses. Again, we don't get as an impressive a speed up here because by its very nature, for a conjunction, it's a bit more, it's less of an accurate upper bound because you're just summing up the maximum scores for your sub clauses. So the fact that, oh, this one will match here, but it's a lower one, and then this will match, but it's a higher one. You miss quite a lot of detail by just saying we're only taking the maximum. But still, we get for high-frequency term conjunctions, we're going for 10 queries per second, 25 queries per second, which is again, that's quite nice. Exact phrase queries, again, is basically just a conjunction with a bunch of extra constraints on it in terms of where the positions are. They are slower than normal term conjunctions, so we're going from 4.5 to 7.5 queries per second because, again, most of the time here is actually spent in terms of things like decoding positions lists and intersecting positions lists and making sure things are in the correct position. But still, you know, it's still a nice speed up. The impressive speed up here, the most interesting thing for boolean queries is for disjunctions, so x or y. And here we can use something called weak and or wand. I was going to, Adrian had a nice picture of a wand in all this, but I couldn't work out how to copy it over in time, so make do with a wand in your mind as opposed to on the screen. Thank you. So, we have, let's say we have a query for the or fox. Fox is a fairly low frequency term. The is an incredibly high frequency term. Basically, every single document in your corpus is probably going to have the word the in it. So, generally speaking, for a normal disjunction, what you have to do is you have to visit every single document that contains any of the terms. So, for something like the or fox, in a naive implementation, you're going to end up visiting every single document because every single document is going to have the in it. And then you'll add in the, obviously the ones which have fox in it as well are going to float higher to the top of the list because the maximum score of a term fox is going to be much higher than the maximum score of the because it has a much lower total term frequency. So, what we can say is, as we're going through, we get to a new block. We can say, all right, we've collected this number of documents. We have a minimum competitive score of one, let's say. You can look at the next block. You can say for each of these sub clauses, we're going to try and work out what the maximum competitive, maximum possible score for each of these sub clauses is. The is only ever going to give us a maximum score of 0.2. So, if we have a document that only has the in it, we know it's not going to match. Which means that instead of having a disjunction, we can convert it into a conjunction and say, all right, we're going to go through and we need to find documents that contain both of them. And conjunctions are much faster than disjunctions because you can skip. You can advance. You can avoid collecting all these documents using the skip list structures that we've discussed earlier. So, again, we get a not a bad speed up on this one. So, if we have high frequency or medium frequency, so the or frox, frox, frox, frox would work as well, I suppose. It gets you six queries per second, and now we're getting 25 queries per second. So, again, 300% speed up. That's nice. In addition to these, we now have something called a feature query. Again, Uwe, come with this briefly. So, what we can do is you can store various boost factors for a document if you want to add in extra things like, so page rank is one example or distances. Or you have any number of scores that you want to ensure contribute somehow to your final score. And you know that this conforms to, if this score is higher, then the total score is higher. Then, in which case we can, we can add that on as something called a feature query. The feature query will store that boost value as a term frequency, and it can then use one of these functions. We have saturation function, sigmoid function, or log function, and it can use that function to calculate the maximum possible score that it's going to provide for a given block because it can, again, look at term frequency so it knows, okay, what's the maximum boost value that's in this particular block. That's going to give you a maximum score contribution, and you can then use that, you can then combine that as you would any other boolean query. The nice thing about these three functions is it's not just that they have this simple relationship. So, you know, as the boost value gets higher, then the resulting score from these functions gets higher, but they're also bounded. In particular, saturation and sigmoid are bounded to one, which means you can sort of, it makes it much easier to reason about how your scores are going to change given this particular set of boosting factors. So, we have some fantastic speed-ups, but there are obvious trade-offs here. You never get anything for free. So, the biggest one is, as I said, we don't get these accurate hit-counts any more, and the most extreme case of this is that we're going to stop, we're going to stop collecting accurate hits as soon as we've filled our priority queue. So, as soon as we've got our top end, we're going to say, fine, we'll just stop there, we'll skip on from there, but that means you can't page, and lots of people want to do paging. So, what we can do is we can say, all right, we're going to use the normal, the old-style collection mechanism for the first 1,000 hits, let's say, and then we'll start skipping after that. And that means that you can return to your user, you can say, okay, well, we can page, if we're doing 10 per page, we can give you the top 100 pages. And then beyond that, you can say, well, we found at least this number of documents. Generally speaking, if someone wants to page through 100 pages, they should be doing a different query. Or it means they want to extract everything, in which case you probably don't care about scores. The other thing here is it doesn't work with facets or aggregations, because facets and aggregations, by their nature, they need to visit every single document. However, what you can do is you can do things in two passes, because if you are doing a normal score query, you want your top-end documents, and you also want facets and aggregations, you're still wasting quite a lot of time by scoring all the documents that aren't going to match, that aren't in your top 10. So what you can do is you can do a very, very fast first pass using the block-max method, and that gets you your top-end. And then you can, if you're doing it on a webpage, for example, you can then asynchronously send off this second query, which does your fasting and your aggregations, but doesn't do scoring. So you're going to get a slightly faster aggregation query coming back and then a much faster top-end query. You can do it asynchronously, so you get your top-end results come up, and then the facets might appear slightly later. Or you could do it all at the same time, and some experiments that we've been running do suggest that actually it will probably be faster anyway by splitting these queries into two. I think I have three minutes left. So does anyone have any questions? Yes. So the question was, by default, you don't get your total hits. Currently in Elasticsearch, and the same in Solar as well, you do get total hits. So when Elasticsearch moves to Lucene 8, which we will do in Elasticsearch 7 coming at some point, then what happens? Are you going to not get these improvements? The answer is, in Elasticsearch, it is going to change. So by default, I think Philip, you can probably correct me if I'm wrong here, by default, the top hit count is going to come back. The structure is changing. So the total hits is no longer a number. It's an object, which will have a hit count and then a relation that says either this is accurate or it's a lower bound. You can then opt out of that. So you'd say there'll be a parameter on the search query which says give me the old style so you can upgrade without having to change everything. But then you can then opt into it at a later point. By default, we're going to give you the speed-ups because we think it's worth it. It's a breaking client change. It's the same. It's all the more longer the turning. Yes, this is true. So the hit value on top docks is no longer... Is it an integer or is it a long, wasn't it? And now it's an object. Again, with this long value and a relation saying either this is exact or it's a lower bound. Hello. There might be some cases where doing this approach or something like if I have a lot of documents that have the same term frequency for that particular term and I send a query with that term then you're going to check for each block if the term frequency is competitive but it's always competitive. It's not always competitive because you do one up. So the question was if you have a whole bunch of documents which all have the same score. So for a term that has... Everything's got a single term in it. They all have the same score. Does that mean that they're all competitive? And the answer is no because once you've filled up the priority queue we go and call this set minimum competitive score function. That takes the score value and it does math.next up. So it increases it by whatever the smallest possible value that it gets up to the next floating point value. What if the increase all the time? So yet no. There are going to be adversary situations, yes. So if you happen to have an index which the term frequency gets higher with every subsequent document then yes, you will probably find that things are slowing down. And there are certain circumstances. There are certain query combinations of boolean queries at the moment where things have slowed down. I mean, not a lot but there's enough... This adds enough overhead that the speed-ups you get otherwise is counteractive by this. But we know what they are and we're working on them. Thanks a lot. Thanks a lot. Thank you.