 Are you OK and starting? Hi, good afternoon. Hope everyone is seated or about to get seated. I'm going to talk about creating more relevant search results with Learn to Rank. My name is Nick Fienhoff. You can see there's an and and an or together with Matthias Michaud. It looks like a query. But the reason for that is that my girlfriend could call me any moment to go to the hospital to get the baby. So if I run away, apologies, apologies. Matthias is over there, so if you see him running, he can take over. We practiced this session. We both of us practiced it because I didn't know I was going to be here today. So for that, maybe a little bit of history about myself. I've been involved with the Drupal project for over 12 years now. I currently work for Dropsolid as the CTO there. And started with Apache Solar as an internship with Acquia to port the Drupal 6 module to Drupal 7. Then I got involved with Search API to merge the two projects, so it has been like a pet project slash my job for quite a while now. Now, before we can do any of this, we need to understand a little bit of the basic concepts of machine learning. I won't go into two scientific depths, but we'll get started by a little example. So this is Drupal.org Search. And I tried to search for install Apache. I'm a new user. I want to figure out how to run Drupal. Today, there's this command to run Drupal. But most of the cases, you still have a LAMP stack. But for some reason, this is the first page. The second result is actually about this session. That's not very relevant. Because most likely, when you search for install Apache, you're searching for something like this. So how do you increase that relevancy? How do you figure out in Drupal itself towards maybe Solar, which is like a backend component that you can use to optimize this to put that result more to the top? Why was this created? Or why did this session happen? I went to false them in the beginning of the year. False them is a massive open source conference. Sponsored and attended by people from Mozilla, Google, really big companies, small companies, massive conference in Brussels. And there, there was a presentation about learn to rank within Solar and Lucie. And at the same time, or just a little later, there was a case of a hospital in Belgium that came with a question. Our search is really, really bad. It's made in Drupal, but we don't know how to improve it. And then there was a little tool with that presentation in false them, where you could get it connected to the Solar Index. You can click on the results that were relevant or not relevant. That's already a good start. So this is a search result, as is. Ideally, these three float to the top. That's the basics. Now, we have the wonderful Umami demo. And I'll explain that also later and hopefully also do a live demo, where we have imported a bunch of recipes because the default of Umami, it doesn't come with enough content to actually do the learning. Where we have chocolate cake as a use case. And I don't want the hazelnut ice box case. I actually just want the regular chocolate cake as the first result. Most likely, that's what we want to search for. This is a hypothetical example. You can translate that in your head to maybe another real-life example. So here's the current situation. Let's see. So if we search for chocolate cake in the Umami demo, we see the hazelnuts, the flourless, et cetera. There's also a Halloween cake. Maybe that's very relevant for this week. And you can see in the module that we'll explain later, you can actually rank these results from the site builder's perspective. It could also be maybe from the user's perspective. That's something that permission could solve. So today, this is not ideal. Maybe also to give you a bit of insight into the statistics. And we'll dive into what it means also in a bit. The top 10 results from everything that we ranked as may relevant or at least like that we want to see. So that's the third number that you see on the screen. 95% of these results appear in the top 10. But actually from the top five, only 75% in that top five were rated as relevant. So let's try in this session to improve that. You could think of that chocolate cake that needs to float to the top, but then across all the results. Now back to some theory. What is machine learning? I don't know if I ask 10 people here in the room. I probably would get 10 different answers. And that's totally valid. If I do the same question with artificial intelligence, I would even get more diverse answers. But it's basically a subset of artificial intelligence where based on training data you get or you try to achieve a certain result. And whatever is generated in the middle to get to that result is a model without, if, else, statements to explicitly do that task. Sounds simple. For example, facial recognition, the CAPTCHA stuff, that's all machine learning. Traditional machine learning, and this is a bit how I got into this concept. I also worked for Acquia in the past and also for Molym. And Molym was an anti-spam service. For example, if you type in some text and it was deemed in the training data as spammy, it would be classified as a classifier with a certain score and say, oh, nope, you're probably spam. This is very traditional prediction machine learning. What we are going to try to do here is very different. It's not that we get a set of keywords with a single answer based on a training data set. No, you have a keyword, there's results. And based on the training data, we know that certain results have a higher value. So learning to rank as a subset or class of techniques to apply supervised machine learning, because it's with training, to solve these ranking problems. So far, so good. I won't go too much in depth, but you need to understand these basic concepts before we can go back to Drupal. They did some research and they figured out a rank net. And in the bottom, you see an example on how Spotify doesn't match between you and someone that has a very similar music taste. How can you figure that out? And they try to find out what is the techniques or what are the musics that you listen to. What is the combined difference between you and someone else, because that makes you similar. And then maybe you can propose similar music taste, because they figure out that you have similar music taste and they can propose music. So the inversions in ranking, that's called. But they did some more iterations, agile working, with lambda rank, lambda mart. And here it's actually where it becomes important for our use case. In this case, we're trying to figure out how much off the result is from the actual result when we try to replicate that specific search query. So these arrows you see, the bigger the red arrow, the higher the cost of putting it on top. You don't need to know anything else for this. This is the basics. Back to Drupal and Solar. How many of you have worked with Drupal in combination of Search API or maybe in Drupal 7, the Apache Solar module? I probably should reverse the question. How many have not? Nobody. Everyone worked with that module. Awesome. There's an issue in the core IDs queue to get it into core, if you want to help. But there's also an elastic search module. In practice and in theory, it's the same. It's all powered by Lucine. And there's a couple of concepts. So there's a document. The document contains a field. And the field is made of types. The type could be a full text or a number or a string. And they're normalized by processors. A processor could mean, oh, I see a word. And I know that word can be split into two. For example, in Dutch, there is the word called ere-loan-supplement, for those that understand Dutch. And that word could be split into ere-loan and supplement. This is something quite tricky to do because then you get into the natural language processing stuff. But that's why you use technologies like Lucine to do this for you. Don't try to do this in Drupal. It's not made for that. All of this is stored in an index. And each document is by default standalone. It's very different from a regular database because in a database, you have connections or relations. In this case, you don't have relations. You have a single document for a result. This has a wild card. There are cases in solar where you could refer or relate to things, but then it could be used for personalized search, which is a different use case. 99% of you probably stayed with a single document. Now this index is created so that tokens could be generated. And then you have a reverse document based on a word. You find out where is this word located. This is the document. And you return that document. Maybe to get some solar 101 with a query. I don't know if many of you looked into the solar log or saw the raw query. I suppose there are little less hands if I ask that. So let's go through a generated query of search API solar, 8.x, 3.x. In this case, we're searching for cookies. So you can see there is a queue parameter. And there's cookies a little bit further. We limit the search purely on whatever we generate from a search API processor called rendered item. But I also want to find out if it's in the title. The rendered item is 1 in importance. The title is 5. So it's 5 to 1 ratio in importance. OK. So then we can see how heavily we need to boost each item. In search API solar, 3.x, it was added into a separate item. And then we also want to tell solar. I want to return the ID, the score, which is how relevant it is. And then I want to return the title and the URL. So far, so good. Because search API solar allows you to index multiple items or the same item multiple times in different indexes in the same solar core, that's a little tricky, you can limit it to say I only want to see this index from this hash. Even if you, for example, add all the content in one solar core for development, staging, and production, you could limit it for this site and this specific environment. The ideal case or the best practice is to have separate ones. And then at the end, OK, I want to return 10. All still following? So what is relevant? I want you to think a bit about this concept. If you start to create search pages in Drupal, how do you configure Drupal to show you the most relevant results? If you take the restaurants as an example, when searching for a restaurant, what is the criteria that makes a certain restaurant appealing? I'm not entirely sure if someone can shout an idea. What kind of criteria would make a restaurant interesting? I don't know if this will work. There's like this microphone. Who wants to answer? Good food. Good food. And how do you see good food in your index? Maybe with a rating. So the rating of a restaurant could be a very interesting factor. So if we do this with articles and recipes and pages, this is an example that I made. You could have the amount of tags. For example, how many ingredients do you have at the freshness, not necessarily the ingredient freshness, but how recent a recipe was added? You could have more, for example, how many comments were added to a specific article or restaurant. But all of these things are concepts that today in Drupal and in Solar and Search API are not really thought of. So you could use these factors as extra relevancy numbers to boost the relevancy. Now, luckily, in Solar 6.5 and 7, this was added. And it's part of the whole concept of learn to rank. You can add, instead of just one score field, you can add multiple score fields. The regular score is based on a string within the whole text. This are features. And it's also called like this, feature definitions, to say, OK, this is from a description. This is the amount of tags. These are the amount of comments. This is the rating or the common rating of a specific restaurant. And you could even go in a little bit craziness of doing ratings per individual. And then have personalized search because the relevancy will be different based on an individual. Back to Drupal. The database best practices can be found at Drupal.drupal Dutch. It's not really pronounceable. Just say Drupal Search. And the code could be found at GitHub. It was an effort, a combined effort, of the search ecosystem maintainers and a couple of sprints in previous events to have an example. But the database best practices are different from the solar best practices. So let me give you a little sneak preview on what I think are the solar best practices. It's OK to differ in opinion, obviously. But if you want to index your content, you should probably stay as close as possible as to what Google does. You index the whole content. You probably shouldn't care too much about specific fields. Don't try to add fields into the index that are not visible, in a sense, except for maybe meta tags or those kind of things. So there is an easy processor to say, OK, I want my rendered item. I want to see it as anonymous. And I want to have a specific view mode because maybe you don't want the labels to show if you use view modes in that way. Be careful of layout builder. It can have unintended side effects or consequences. If you do layout builder on an individual basis, so you have different variations per content, you won't be able to index blocks that you add through layout builder with the rendered item. So this is a discussion we still need to have. If Google can see your meta tags, why shouldn't your index see your meta tags? If you have keywords that you add, even though maybe Google doesn't always go to those keywords or doesn't support them anymore, it could be a very important feature for you to say, on these keywords, I want to increase my relevancy. So there's a patch for meta tag to enable it to be seen in search API in the index as a field, OK? All the other fields that you add are probably not very useful, except for if you want to show them as facets, as filters on the site, or you want to use them in the feature definition for the machine learning model, OK? And then in views itself, there is a bunch of different parts modes. I suggest to use the multiple words with edismax. Then you can get the query as I showed earlier. These are the fields that you want to search for, because you index those two, the title and the aggregated fields. There's a couple of processors you can enable, the highlight filter to show a little snippet and to show where the word was seen in that snippet. There's the HTML filter. You don't want to get HTML into that index. I'm not sure if you ever want to search for the strong tag. I'm not entirely sure what kind of site you're building, then, but probably not content itself. And you could add maybe some type specific boosting to say that a blog is more important than a news article or vice versa. And then there's a little trickiness there how to get that little snippets in that view to see, OK, Solr did the query. It probably found a document. And it can also show where in that document it found the keywords that you were relating to. So here in Search API Index, you enable these two boxes of retrieve result data from Solr, get the highlighted snippets, and then in the view, you do another checkbox of create excerpts, even if there's no keys available. Otherwise, you get empty snippets. And then you say, OK, from which fields do you want them? And then in the field itself, you can add it to the search excerpt or to the fields. You can even do, OK, instead of showing the whole view mode in there, you do the fields in views and say, OK, I want to see the rendered item as a view mode as a field, and then add the excerpt because it doesn't come from Drupal. It comes from Solr. And then there's another checkbox that you need to use highlight field data. I think this process probably should be simplified a bit, but it's certainly possible. So back to learn to rank. We have the best practices of Drupal and Solr. And this is the case that I was talking about of the hospital where we did this training and say, OK, this is the keyword that I added. And these are the results that the customer clicked on and say, this is relevant, this is not relevant. This is relevant, this is not relevant. And after a little while, we were able to make statistics on how well our search index was performing. So before applying these best practices and maybe to add a little context there, this was also Drupal 7, so it was a different case. We were at 49% of items we expected in the top five that were actually in the top five. So that's the recall parameter. If we take a look at the top 10, it's actually quite disappointing because it only added 6% of the relevant results in the top 10. So that means that 45% of the results that were expected didn't show up in the first page. If you take a look at the statistics of Google, you probably just click away if you don't see it on the first page. There's kind of like a support group of people that actually click through page two in Google. It's the same on your Drupal site. After the optimizations, without the learning to rank algorithm, we got it up to, this is Google. To 76%. So just by applying the best practices within Drupal. And then for the top 10, it went up from 55% to 81%. This is without the machine learning. If we then apply the machine learning model, we actually got up until 85% and in the top 10 to 91%. So that's quite a huge boost, resulting in a lot less, like a lot fewer support calls towards that hospital. So you can see there was a tool built for this inspired on the FOSDEM talk that I saw, where you see all these different parameters by keyword. So in a sense, you really go into depth into your search relevancy optimization by mathematics. You don't say, OK, from this word, I want to see that on the top. From that word, I want to see that on the top. Because if you start to optimize your search in that way, you probably will get lost in the long run. And it won't be very performance as well. All right, so as we saw, the recall, the recall should be nearing 100% as much as possible. So from all the relevant results that we ranked, how many of those did I see on that first page? If you want to find some more info or detail, there's a lot of resources out there to show this. So with this rank clip library that we had in the theory in the beginning, we can actually generate a model. And it's natively supported by SOLAR. So you generate a JSON file, and I'll show the model in a bit. You just uploaded it to a REST API endpoint from SOLAR itself, and you can enable it. So let's take a look at such a model. I'm going to scroll quite quickly. This is a JSON model from the Umami site that was generated. And the main thing that you should see is that there's no keywords in there. There is no reference to any node ID, entity ID, whatever. Nothing like chocolate cake or any of that stuff. The only thing that it shows is the features that we defined and the optimizations on how it generated that model towards those features depending on the result. OK. How do you then apply this in SOLAR? It's very simple. You add one parameter to that same query. That's the RQ parameter. You say the query words, chocolate cake, for example. And then the model that was generated by that tool. And it essentially means that you are re-ranking from the 100 results that you see in the first 100 using the model. You re-rank those 100 results to then get a new result. OK. There's a Drupal module out there also called search API LTR, where there's a Drush command to generate the model. You can see the model in search API itself, select it. You can train it in the view itself to then re-rank your results in real time. So how can you do this? It's out there. The hard part of this is to define the feature sets. You need to figure out what your data model is and what is more relevant towards you. There is no magic bullet. So if we then apply that in the results and we'll do that in the demo as well, you can see that from the three green ones that we selected, they used to be a bit of everywhere. They were floating, or now they're floating to the top. The really nice thing of this method is that it doesn't use any keywords. So it also means that for future keywords, it will still use that model. And it will try to predict what is relevant. So even if you only train half of the keywords that you think are really important, most likely the rest of your search results will also become more relevant. All right, let's see a little demo. So here we see the statistics. And I switched the numbers a little bit. So the recall of the top five on the left, we see that from all the results that I deemed as important. And you can see that here. I added a whole bunch of words in this tool to see the statistics. 75% of those flow to the top five. All right, so let's see about our chocolate cake. So this flourless chocolate cake, I think, is very relevant. A little Ajax-y thing happening. And because it's Halloween, I also think this creepy crawly cake is very relevant. If we then go to the back end of the search API interface, you can see there's a little tab added. And here in the bottom is the result of what I just selected as relevant or not relevant. So you remember the machine learning part where you have a training data set? This is your training data set. And I think here you see added as the last part. It's a little tricky to see here. Chocolate cake has two results that were added. All right, so then with Drush, you can have a command. And you define the view where you did the training because in Search API or in Drupal itself, you basically create search pages using views. So the page that you want to trigger to train from or to create your model from is a view and a specific display ID. So in this case, it's the search view with page one. It will execute all the searches that I did, including chocolate cake. It will look at the training data to see what is relevant and what is not. It will also look at the scores of all these different features and then send it to the rank clip Java library to generate the model, upload it into a solar using the Drupal APIs. So now you can see it was added. And if we take a look here, you can see now there's like a second model. Sorry for the screen, but I'll apply it. And let's see if our chocolate cake has a different result. All right, very cool. Because now I have the flowerless chocolate cake on top, the creepy crawly cake in the second one. You can see that even though I ranked it as highly relevant, it doesn't go to all the top because it needs to take into account all the other search query words. So the model needs to be generic enough for all the keywords to work. If we then take a look at our statistics, and this is a part that isn't in Drupal itself, we're intending to also port it towards Drupal, you can see that now from the top five results, 91% of what we marked as relevant is in that top five. If we take a look at the top 10, it doesn't really make any sense because we don't have enough results, so it always will be close to 95. But for the top five, this is very important. And now we actually optimized the relevancy or we optimized the search pages based on feedback, based on training data. It created a model. And using that model, we re-rank the results real time to show more relevant results in the top five of our search pages. So that's awesome, no? This tool also allows you to quickly check the difference. So here, chocolate cake was the original. You can see the difference. If we have time, we can maybe try a live keyword from someone in the audience. If you don't believe me with this sample set, but that's basically the conclusion. So we really optimized the search relevancy of this thing. So we compiled this learning set. We trained a model and uploaded this result to Solar. It's natively supported from Solar 6.5. If you want to use this in Elasticsearch, you have to recompile Elasticsearch. You have to enable certain flags to get it up and running. So I'm not saying it's impossible, but you will have to work a little bit against the stream. We used this model during our queries in real time to re-rank these results based on the trained model. So we always ask, what's the original search? And then you change the order. But it's insanely important to have this good data model first, which means getting all the basics on Search covered first. It doesn't make any sense to start doing this if you don't have that basic stuff covered. So in order to do that also for Dropsolid, which is the company I work for, we enable and allow this out of the box for the machine learning. There's a couple of switches you need to do in Solar itself to get this end-to-end running. And then, obviously, the company pitch, if you got interested after this talk to work with us, we're hiring, come see me if you're up for challenges like this. Now also, help us move Drupal forward. You saw in the previous session in the keynote that there's a lot of help needed. As Dropsolid, we intend to do our part. There's a little survey you can fill in. And in order, as an exchange, we will donate 15 minutes of time to a core contributor by paying someone those 15 minutes. I think the counter is like above 1,000 minutes already. And then I'm also part of the organizing team of Drupal Developer Days. It will happen from 6 to 10 of April in 2020. Maybe we can all sprint together on search-related items or on other related initiatives. But it's in our hometown. So hope to see you there. And then also tonight, there is a splash awards. I'm also part of this initiative. So I'm also hoping to see all of you there. Or maybe some of you already entered a case where maybe you could win. I was also reminded by Baris that there is a dinner for strangers tonight. And I think you can still sign up. He's the green person over there. Sign up before 5. So with that, we still have around 10 minutes ish left. So I would open up the floor for some questions if there's any. We have these little boxes that I can throw that feature as speakers. Does it work? OK, there's another box over there. So you can throw it already to the next person as a question. Yeah, question there. So this is just rearranging the position. It's not filtering the data, these models. Correct. So you will always have the same data. It will just re-rank as it says, the data. Maybe the other box, if there's another. Just throw it to some hands. Think for the person here in front, there was a blue box over there as well. Yeah, the matrix that you are using is precision at K. And then recall at K, have you tried to use the other matrix like map at K? Means of rest precision or NDGS? So the ranklip library offers a couple of algorithms to try. I can maybe show it here in the code. Yeah, it's hard to see. And I don't know how I can really zoom in here. But there is rank net, rank boost, adder rank, coordinate ascent, lambda mart, list net, and random forest. There's also linear regression. Those models or those algorithms could be used in this library. When testing them, I found the best results with lambda mart and then a specific parameter called NDCG at 10. But it's up to you to experiment with that. It doesn't really matter for solar. It accepts any of these formats. I don't know if that's an answer to your question. Yeah, it might answer the questions. Because you are using the NDGS, right? That's a metric to calculate the relevancy of the result, usually for the relevancy. But then your presentation is you are using precision and recall. So it's kind of like I'm a bit confused about it because it doesn't match between. I'm happy to listen to you in terms of the mathematical side of that. This is also how I learned it. So maybe it's wrong. And then hopefully I can correct it for the next one. Yeah, then I'll talk to you after this. Yeah, no perfect. Come up afterwards. Hello. I was wondering how much content would this be useful for? I mean, you showed the Umami has only 10 pages. Starting off, how many pages would this be useful for? That's also a good question. It also depends on how often your search pages are used. It could be very useful to look into Google Analytics, how long people stay in your search pages. If you don't get the conversion from those search pages as you expect, most likely this could help. And then it doesn't really matter how many items you have. Obviously, if you only show 10 results and you only have 10 items, then it doesn't make any sense, except for maybe you want to float it to the top. But I think with some Drupal-specific optimizations, you could already solve that. I have another question. Maybe also for answering that. If you use the database backend, and it's good enough for you, then you cannot use this. So then that's also kind of an answer. I also had, if it's OK, another question. What happens if content is rewritten? Would that be re-indexed? And would that matter for the indexing of? It doesn't matter because the model is not based on keywords. It's not based on content. Even if you add new content, delete content, the model will stay active and present. That's why it's more interesting than adding or using the elevate.xml file in Solr. Because that uses specific document IDs to float to the top for certain keywords. And that's not very maintainable. Thank you. There's more questions. I see another hand there. I don't know if there's more. Yeah, go ahead. OK. So I assume that users can affect the rating globally so that it affects for the results for everyone else as well. So there's a couple of things you could do there, I guess. I don't know if your question was finished already. No, it wasn't. So if it was like this, I've learned before that if you let users to do that kind of thing globally, they will input like anything, like something that is not relevant or is crap. So do you have any ideas to work around that? What I showed is for the site builder, so not necessarily for the user of the site or the visitor. You could, for example, use Google Analytics to see what results were clicked on based on certain words and create your own training data set based on that. The module itself doesn't give you any assumption on how you generate this field. It's a JSON file with document IDs. It generates a model for that. These little things are just a helper for site builders. OK, thanks. Thanks. I was wondering, you showed a lot of examples I have to do with keywords. But what if users prefer newer content over older content? Would the system also be able to learn from that? Yeah, so this is what I showed here in the feature schema. So here's a feature called freshness. So it adds another relevancy factor to freshness of content. And because you train the data, the model will learn how important freshness is to you. Usually, you also can add this as an extra query in SOLAR itself, so as the base query. And then more recent content will always be more relevant. And only after that, you re-rank the results based on a model. So maybe to reiterate, it doesn't remove you from doing crazy things with your query. It only re-ranks the results that comes from that original query. Thanks. Hopefully that was an answer. Yeah, thanks. Maybe one last question or two last questions. Couple more minutes. Behind you. You said that it should be trained for view pages. Can it be trained for the custom page with the Search API? API, yeah. You could, but then you're responsible yourself. How you can get those bullets in there or how you can get that JSON data. So in the sense, the view that comes with the module, this is just another field added in the view. As you can see, this is the title, this is a snippet, and this is the relevancy field in the view. If you can figure out how to do that for custom pages, by all means, if you somehow generate this yourself, because you know which content items are more important, also feel free. That's why it's an open text box. And I really want this stuff to be as little black box as possible, even though it's quite complicated material. Hopefully that was an answer. Yeah, yeah, thanks. Maybe one last question behind you. Yeah, hi. So in a project architecture wherein we are not using the fully coupled system, so wherein I'm not using the Drupal Search API and just specifically using the Elastic Search. So the rank up would not be the most applicable in those systems. So what would you suggest for the most relevancy search in that case? And you're indexing using Drupal or you're indexing somehow differently? I know, just using some libraries independently, not using the Drupal Search index. If you somehow index it by crawling, then it's going to be tricky, because you need to be similarly smart as Google to figure out the metadata to get structured information, to then do relevancy boosting based on that structured data, like the date when it was created, and those kind of things. If you figure out how to get that structured data, you can probably do boosting in Elastic Search on that structured data. And then if you're going to want to go crazy, you can recompile Elastic Search and use the re-ranking capabilities with the model and you train, but you have to do all these steps yourself. So it might get tricky. Oh, thank you. All right. Thank you so much for your attention. Hopefully this was useful. And see you around. I didn't get a baby yet.