 Zato bilo noga intriguing, poslejo taj povedar na trakuljem, a da gledaj za representacijo. Prvo priprašljamo, ovo je najš inventions. Zanimamo, ovo je najšedar, ovo je najšedar, ovo je najšedar. Ovo je najšedar, ne bo to bolje. I na rejne ich poslpovali jaznevali, ovo je najšedar, ovo je nebogatno nazivne, bomo bolj poslpovali ovo in ovo je...SU, je je nebogatna, ovo je naklavno, in demo, čo je. So a little bit about myself. My name is Alessandro Benedetti. I'm a software engineer, specialized in search. I work as a search consultant, R&D software engineer and director of my own company. So a lot of business related things. I'm a machine computer science. I'm a passionate about apache design and apache solar, and also integrations, such as machine learning integration with search and semantic and natural language processing. And in my spare time, in summer I play beach volleyball and in winter I snowboard. Andrea, my colleague, is a software engineer starting in 1999. Her meets, so usually works remotely and on its own. Passionate about Java and it's a committer of apache QP. It has been a father to all the projects mentioned. So we work at CESA. We are open source enthusiasts. We are based in the UK, in London, but we are geographically redistributed and we mostly work on open source technologies in special apache, lucine and solar. Our focus as late is on leading to rank document similarity and specifically apache, lucine, more like these and related integrations. Search quality evaluation, which we will talk today, and relevance tuning. And it is just a list of our clients. OK, so let's introduce what is search quality, what is search quality evaluation and why it's important. So search quality is important for various stakeholders. So the business would be focused on the final users and possibly the owners of the business. We'll be focused on correctness, robustness of the system, reusability, mountainability and extendability from a cost perspective. While the software engineer and the search specialist are going to be interested in search quality from the point of view, are we actually providing the best results to the users. Is this technology in the best way? Is it testable? Is it reproducible? And specifically why correctness is important. It is important in the development cycle. So when you have your search team, the developers working on a project on different iterations of the project, they want to communicate to the business, to the business layer, the improvements or regressions if they happen. And also be able to show. So it's like is this costly operation that costs you the six months for an entire search team actually produce some value and sometimes just words are not enough and you need to show not the code in this case because they wouldn't care about the code but at least numbers and results. So show yes effectively we improved search quality through these metrics. And this come how you can evaluate search quality. So there are various metrics you can use offline measures which are independent of your search system and are independent of your runtime search system. So they can be calculated offline. They don't need to be related to your production environment, to your running server. And these metrics are like, we already been talking about that for example during the linear to rank talks. So like precision, F measure CG, recall, linear to rank there are a lot of metrics you can use for offline evaluation of your search quality and when you evaluate your search quality you need ratings of course so you need the ground truth you need to know what's good and what's not and then you can calculate the metrics on top of that. And then you have online measures you can actually evaluate your running system so your production environment you can check for example how engaged we provide are they abandoning the search interface are they getting a lot of zero result query so here comes the rated ranking evaluator which is an open source software to help the software engineer both in the development phase and to justify and show the benefits of recent development to the business. So we started showing RRI in London at the Vashi Lucene Solar Meetup in June. At the time it was just me and Andrea and the life of the project was a couple of months so it was a newborn baby. Then at ASTAC the relevance conference by open source connection in London in October we show the improvement of course like a lot happened and first of them in the team in the past we are like a grown based on contributors the RRI is effectively a set of tools so it is a framework composed by different libraries and different possible integrations so this slide may be a little bit confusing but you the focus is you have a library core which is the equivalent of Apache Lucene for Apache Solar and Elasticse and then you have a lot of different components some of them in the roadmap some of them already developed that provide an interaction with the library. So this is a set so just to go back a little bit the different components are going to provide a maven integration so you can run like a maven build to be able to evaluate the search quality of your system and we talk about offline metrics so R&D will allow you to calculate offline metrics on your search system given a set of ratings so this is a set of available metrics is not so important right now to go through them so it is just we implemented a set of them from time to time we will contribute more and of course potentially other contributors can let's introduce the way the the information is modeled in R&D to allow the user to model the ratings in input that will allow the system to know what's relevant and what's not relevant and then will allow the system to produce results that are easy to be navigated so you have an evaluation iteration that will run on a rating set on a specific system that will be characterized by a set of configuration the system can have different corpora so you can imagine you have different collections of data different domains for example within the the corpus you may optionally have different topics for each topic you may have different query groups that produce the same results and each query group can be represented by effectively different queries and you want to calculate on each query and then be able also to navigate what's for the query group the aggregation of the metrics what's for the topic corpus and therefore not everything is of course required you just require the corpus so you want to I mean you need to identify the collection you want to evaluate and the query so you don't need that I mean if the topic is better for you you are going to model the data but if you don't have it, it's fine it's not mandatory so you have an input layer where you model your ratings so effectively you need to build the ground truth you need to identify your potentially topics, query groups query and associate documents to the query with a rating this can be done as we see in learning to rank you can do that explicitly or implicitly that's not the focus of the talk but we assume that for how to work you have a set of ratings then you have the evaluation layer which effectively is RRE with the internal libraries that will be able to take the rating set and then run the queries against your search engine and based on the results coming from the search engine give you back the results so how is it performing according to the ratings then for each iteration you want to generate so whenever you say ok I want to calculate the search quality of my system I really will output the results navigable in a tree-based structure so you can navigate like ok I want to see the results and higher level just I'm interested in the topics I'm interested just in the corpus results or I want to go deeper for example I'm a developer I'm interested in some specific queries that are failing so this is more for an offline use because the slide is quite intense but it just shows you the way of working for RRE so what you can do is so we were talking about the fact that you can evaluate your system so to evaluate your system at the moment what you need to do is you need to reproduce the configuration of the system and RRE is currently compatible with Apache Solar Apache or Elasticsearch and what we'll do we'll spin up an embedded instance of your server with the configuration you provide and use that embedded instance to then calculate the results so a corpus as I said already is going to be a collection so it can be basically an index can be an Elasticsearch collection generally you want to identify the domain and then you can run the quality evaluation on that specific domain the configuration sets is what define your search server so it can be again solar configurations for example or Elasticsearch configuration the index structure the way you model your queries from a simulation perspective and then you have like the possibility of defining query templates so your query can just be a free text search just query terms or potentially they can use like specific boosting additional filtering and this of course is like a various example that can be like they can follow an Elasticsearch syntax they can follow an Apache Solaris syntax the ratings so the ratings you give input effectively from a real perspective are going to be a JSON file that defines the queries define the document and then define how relevant the document is for that specific query so you can specify the gain spare document or you can specify I have for this rating this set of documents and in the evaluation output you're going to have in the rating set so the same tree and you can explore digging down to the tree hierarchy and seeing the various metrics calculated lower level so lift level is going to be per query but you can navigate up to the query group's topics and then corpora there are various ways to see the output you can have like basically an Excel page with all the results according to what you calculated or you can have a dashboard currently that we will see at the demo that show you for the different topics query groups and query the various metrics how they perform in the last iteration and how they compare with previous iterations because of course it's really important to be able to compare in the previous iteration for example we got an improvement with these metrics but we got regression in comparison with much older iteration for example so this is going to be so there are different aspects so it's going to be useful both for the software engineer that effectively needs to debug specific queries and also evaluate how the system is performing and it's also good for the business layer to quickly understand so they may not be interested in the list of documents per query so if they are relevant to they are not but they may be interested that overall the collection is doing well so even in the last print after three weeks of development we got an improvement in precision by this percentage there are various questions of course is an evolving system is still in development is improving slowly and there are various questions like is it possible to persist the evaluation data and then compare them afterwards I don't use Java because it's written in Java it's integrated effectively through Mavin with other platforms so you don't really need Java but potentially if you want to contribute it's written in Java and various other questions that we'll skip for now and we'll go with the demo which is going to be presented by Andrea that will show you a use case of RRE so starting from the book Relevant Search by Dr. Turnbull we will follow some example from the book seeing how you can check the search quality of your search system understand where are the pitfalls approach like with modifications in the configurations and then see how they contribute to the improvement of the system effectively So, good afternoon I'm Andrea Gazzarini the guy you've seen in the previous slide and first of all is my fault for them 2018 I'm the responsible and I will change later and as Alessandro said this is a brief work through for demonstrating I mean how RRE can be used for doing some search evaluation we provided in GitHub in our GitHub repository demo repository so you can have a look you can download you can play with it what is it RRE-enabled project basically there is no Java code RRE is written in Java but you don't need to know Java because we have only a couple of requirements and Java of course but all what you need to do is to run to provide of course all those files or data that Alessandro said and that has nothing to do with Java and you need to run some comments in the moment the RRE as Alessandro said is up there is a core but the runtime container the implementation is basically a maven plugin so you can inject the quality evaluation process in your build process and for this demo we used some examples from relevant search with the data set which is an extract of the TMBD and we are using as like what happens in the book Elastic Search specific version of Elastic Search 632 so the index shapes the schema, the Elastic Search schema and the query shapes are coming from that book and that Elastic Search queries but if you want to start a new project if you want to create a reenable project there is the RRE which explains exactly how to do that we can try now but it's just a matter of the only thing at the moment is our dependencies are not in maven central so you need to configure in your maven settings the Caesar repository or explain it so we can we basically provide as part of the RRE itself a module which is a maven archetype which creates, gets of course some parameters related with the search platform that you want to use solar or elastic search your group ID your artifact ID and this is in demo effect yes this is mid success so at the moment I've created a RRE enable project that can I load in my ID and here just to have refluk this is create you can see that there are under the source ATC folder which is the default but you can change through the archetype several subfolders each folder contains those data that Alessandro explained so the configuration sets that are specific to elastic search the corpora is not the TMB the here but is just a simple and following the same path the rating example and some queries in place the archetype also provides a couple of simple iterations 1.0 and 1.1 with some configuration this is the basic configuration and here in the second iteration you can decide which metrics you want to see you can use the default settings also for folders for for for for for for for for for folders for fields that you want to see in the results and some other the RRI wiki will includes all the available parameters you can tune the evaluation process so on another workspace I cloned the demo repository and here we can see there is more complete example with data from relevant search with some data from relevant search loaded and we have 7 iterations what is an iteration is basically internal external development iteration at the very beginning we would have only one iteration the first 1.0 0.1 whatever you call it and version is composed by index shape I mean the configuration of elastic search, the mappings and here we can see that we won't start with the default configuration empty, the schema's mode decorporized the TMDB dataset and we have the main input file which is the which contains the ratings so for each topics, for each query group for each query we are going to describe what are the relevant documents that we expect in order to execute the evaluation process and get the evaluation measures so if I start RRI it will scan all the available versions which are 7 in this case I'm going to use let's assume that we are at the very beginning so we have only the version 1.0 I can use a filter in it's one of the available maven parameters so we can run the evaluation process using only the version 1.0 on the other side I have the RRI console which is basically an angular web app which listen for evaluation data and it gets automatically refreshed when I run a build process so sorry this is the if I run the evaluation process you can see that here already we run only on one iteration configuration version 1.0 and after I have to the screen is not so in the configuration file I set only three metrics one precision 10 and NDCG at 10 on the left side you can see the RRI domain model so you can see the corpus which is called TMD.Bulk the two topics Space Jam and Star Trek the three query group under those topic and finally the queries basketball with Cartoon Aliens and Patrick Stewart here I can see the values of the computed metrics, the evaluation session has been executed and RRI correctly computed the requested metrics ok here I can see some information it's not really useful until I create a new iteration because in this way in that way sorry I can see which direction my system is going if there are some improvements here I can see that for example the basketball with Cartoon Aliens is not performing so good actually bad because both the three measures are zero so there is another information that the console can show you in this query for each version I can see the top end results red means not relevant blue means relevant so here I can see why we have all those zeros so we won't enter in the details about the differences between the several iterations but let's say that at certain point of time the user created version 1.1 here the difference is that instead of relying on a plain schema mode, guessing mode of elastic searcher the user configured properly a couple of fields with the English analyzer so if I enable also the second version RRI will execute the evaluation version but this time we'll use two versions and we would get some comparison data for each metric we can see the version point 1 and version point version point 0 and version point 1 we could see the metric value at query level and how it is aggregated so I can see that at corpora level I have a loss of 0.5 so something is not going as expected the best ball with cartoon alliance is still not good because I still have all zeros and if I see if I expand the query and result for each query I can see that I improved the number of results because we moved can you see from sorry we moved from more than 1000 to 84 items are still all red so there is some difference they are not the same documents but at bad point because you know you have those 30 so the developer created another iteration let's see version 1.2 and redrunning the configuration the configuration in this case the difference between version 1.1 and 1.2 is in the boost the user gave to the searchable fields in the query not in the index shape so once the build is completed I can see that this time the basketball with cartoon alliance performed good because the precision at 1 is 1 that means that the first result is good the precision at 10 doesn't have a good value but if we have a look at the writing files at the ratings that the user gave for this specific query he configured just one relevant document space gem so from the 1.10 perspective this is the maximum value that we can have because there is only one relevant document and at the same time this is another thing that RRE allows you to do that you can see the effect of the iterations not only horizontally but you can see also the side effect or regressions on the other queries and here you can see that you see some red button somewhere so that means probably we fixed the first query but something went wrong in some other query specifically Star Trek, Patrick Stewart we we see some between version 1.0 and 1.1 that has been a loss because we moved from this is the precision 1, the first column and you can see that in the first iteration we had the top item which was relevant longer there so that's the reason why we lose and in the third iteration another item came back in the first position the same this is the precision 10 the second column we moved from 7 relevant results for version 1.0 and 1.1 to 4 relevant results so it's not so good but you can do this reiteration how much times you want the several iteration doesn't need to be necessarily releases but it can also be experiments so you can do things like this one here with this monitor is not so probably is a little bit messy because here I am enabling all the versions so the result is a little bit just moment for the build but probably you should have a big monitor for dealing with oops for dealing with such data it's interesting because if you want this actually I enabled three metrics but usually you are working metrics by metrics especially if you have a lot of queries especially if you have a lot of versions and in this way you can trace specific query analyze why it is behaving in that way comparing with different version why in the first version was good I did seven iteration and things are not so good so you can do this kind of analysis however you can use also the filtering capability of seven plugin for reduce the number of iterations that you want to consider so let's assume that we are at the end of our story I have seven iteration if I run all the iteration I get the huge console so I could be interesting only in seeing the difference between the last iteration so a couple of months ago we had 1.0 now after a couple of months we have we have seven iteration ok let's see what happened between all those changes and here probably is a little bit small but here you can see that we have a lot of green so green or yellow so that means that our system is stable through the different iterations or there has been some improvement the Star Trek Patrice there is something that is not so good in terms of precision at ten because we moved from seven to four so here there should be something that needs to be checked and probably fixed and doing that I can control my system and I can avoid as much as possible regressions unexpected side effects of changes that I'd like to introduce for example for enhancing some query or improving some query it allows me to have a look at the whole system this is of course a small example because we have three queries so I expect that the production system the file the reading files would contain a lot of queries a lot of that's the reason why we modeled the RRE in that kind of tree because it helps you to divide to classify things in topics in topics groups and so on sometimes like in this example this classification is a little bit complex especially when you have simple systems or you have few queries and that's the reason why a lot of those nodes query groups topics are optional you cannot meet them you can have just a list of queries nothing I think that is more or less there is also this kind of explanation we are working on that we are working in general on this RRE console because here for example a nice addition would be to add some kind of popup that shows the Lucene explain for understanding why that element is there in that position we have a lot of ideas about how to improve the framework and the framework is open source so if you have some feedback, some ideas whatever any feedback is warmly welcome what do you want to talk about? we can go to the future works so thank you Andrea we see now you can use RRE effectively on a demo project so this example was on the last research but that of course is valid for Apache Solar as well so it strictly depends on your system so you can zoom a little bit and move to the latest work so configuring the queries configuring your system so the config set directories we've seen in this example were for last research but you can imagine if you have Apache Solar as a server you will do the similar thing so from the perspective of future works we are working on the integration with Solar of a rank eval API so basically giving the opportunity of integrating Apache Solar directly with RRE to push ratings to Solar and get back directly the ratings for your system then we are working on a Jenkins plugin so this would be quite good to be integrated in your continuous integration and continuous delivery system because you would like to see after a release for example how the system is going so it would be quite good to explore the performance on the system with a quick dashboard directly from your CI plugin so like Apache Jenkins for example or at Russian Bomboof or what matters and one important point we are working on is how you build the input so how you build your rating set so building the rating set is actually the most tedious part because if you do that explicitly we will have the same problem of a learning to rank project so you will need people, you will need experts you will need crowd to do that and they are not going to like to just put lines in a JSON file that's good for RRE to take this input it goes from a machine not very good from a human it's readable but not very writable let's say so there are two possibilities we are working on a moment a judgment collector UI and basically show to the user couple so pair of documents given a query and the user will be able to say which one is good and which one is better actually or implicit collection so you actually get the user interactions that can be clicks, add to cart sales whatever is relevant in your business and then you evaluate from evaluate the distribution of them and you extract automatically the ratings so we have the same consideration that we saw at the learning to rank talk so you will have implicit which is cheaper much bigger in number but noisier while explicit judgments is going to be of course much more precise if you have a good set of experts but much more costly but this is a way you can improve the way currently already works right now you need to provide the jazm and effectively you need to write it so we are going to improve that side and once you collect the data so once you collect your ratings this is actually going to be quite useful because from the data you collect you could actually make an integration and indirect integration with learning to rank and build the training set out of the data you collected and at that point you will be able to evaluate your search and also actually improve your search relevancy and this is just a brief slide about the problems you will have when you move from data from user interactions to a rating set and possibly a training set for learning to rank we don't have the time to explore it but the slides are available so you can take a look on your own and the same would be how you estimate the relevance labels so for example you want to take into consideration the clicks through rates so impression clicks for example or bookmarks or attitude charts, sales and based on that relevance from a scale from 0 to 4 for example and when you build your training set if you want to integrate for example your Vashi solar learning to rank you need also to extract the features you defined to build the training set to a readable way a readable version for solar so you will need actually to automate that part as well and this can be an improvement so you collect the user interaction and if a configuration file for the features definition in solar so this is the summary about RRE so starting from the introduction the description of the framework a quick demo and some of the visual worlds and now it's time for the lessons we have one there so the question is if I have an apache solar instance configured what will happen with RRE will actually eat reindex in memory and have an embedded distance of solar that will be used for evaluation or what else so the answer is at the moment you need to provide the configuration set and effectively RRE will run and spin up in memory so in the memory that is dedicated to RRE an embedded solar instance with that configuration fill the data in the solar instance so effectively it's an embedded one so the index will be on disk depending on the configuration and this is one way we are also working in integrating actually like instead of having to provide the configuration set and then spin up and the distance trigger and existing already instance like a QA for example instance but yeah the quick answer is yes at the moment you are going to embed that instance in memory but the index will go anyway on disk one thing that we probably I don't remember if there is some slide there already has a lot of extension points where you can plug in your one of those extension points is the search platform there is a kind of search platform API which provides an abstract interface of the life cycle of a search platform that RRE expects at the moment we have just two bindings which are the embedded elastic search and the embedded solar but it's very easy and we are working on that to create a remote search platform for integrating RRE with a staging server staging elastic search or a solar does this answer the question so so you mean if you switch the entire search system behind you couldn't but I don't know right now if it's going to work at different systems so they compare it on but I'm just talking loudly because interesting question actually you can't I'm trying to imagine what is the constraint I mean from the code perspective but theoretically maybe it requires some implementation of course but it's not an impossible the question was if you can compare different versions from for example instance of Apache Solar and the following version is actually you completely change your search server and it's elastic search yes please so the question is what if you have like a custom plug in such a query parser in your search instance and you want to provide that as part of the configuration effectively will be code it plays a little bit different between solar elastic search solar the visual machine that is up is just one so basically if you have this kind of need you could have instead of POM packaging like the POM package project packaging you could have Java packaging so you can have also Java code and with your query parser or your addon in the configuration set you are just referencing those classes and they are picked up in elastic search we had to do some change because because the embedded version of elastic search requires if I remember at first or time that you give the list of plugins so and the Maven plugin allows you to configure metrics, fields and plugins so you can have your plugins in the source folder and you can configure them through the Maven thanks a lot for your questions we will have to take offline thanks for your talk thank you