 Hello, my name is Claudia Kapp. I'm from Germany and work as an information specialist at IQUIC, the Institute for Quality and Efficiency in Healthcare. This talk is about the package that we're currently creating for search strategy development using text analysis. I will give a brief introduction for the use case and workflow of the package, and afterwards I'll show you the current status of the Shiny app in the package. To conduct a systematic search for evidence synthesis a search strategy is required, which returns reproducible results. The goal of the package that I will present today is to help identify the relevant terms empirically to create concise but sensitive search strategy to find all potentially relevant studies. Our information retrieval team at IQUIC has been searching for a new tool for a while now. During last year's Esmar Conf my colleagues discussed a few key points that would be relevant for our team in order to be able to use a new tool for our search strategy development process. Our requirements included a graphic user interface for the app that our IT security team had to be on board with the suggested tool and in order to be able to exchange ideas with other researchers the tool should be open source. A Shiny app would actually meet all these criteria. So just a hint for you to find up more about the discussion on automation and information retrieval. Go check out the panel discussion that was joined by my colleagues Elke Hausmann and Siefer von Schmidt on last year's Esmar Conf 2022. It's called Building an Evidence Ecosystem for Tool Design. In terms of functionality we actually did identify some R packages that could do what we needed for our workflow. The problem was that they would require quite a lot of R skills to handle our use case and not all of our information specialists feel comfortable using R without a graphical user interface. So in our Shiny app we basically created a customized workflow using a bunch of other packages that already exist. The packages that we currently use most heavily are Deploy and Tidy R from the Tidyverse, RefTools from the Metaverse and of course Quantada which is a bunch of packages designed for quantitative textual analysis. And finally we also use Shiny to create the user interface. To understand what our Shiny app actually helps with I will very briefly describe the process that we use to develop a search strategy and specifically I will introduce the concept of overrepresented terms. So let's imagine a very simplified example. Usually a search strategy development starts with a test set of already familiar publications of studies. These serve as a gold standard. I will now go into detail on creating a test set here. This could be a complete presentation all by itself. Having identified a test set next up is the frequency analysis of terms in this gold standard of references. This is to see which words occur in all or most of the references that we aim to find in the systematic search. For example in this example fake word cloud participant randomized appear equally frequent and one might thus expect them to be equally good search terms. On the other side they and off are obviously frequent but irrelevant terms for a search while control and placebo are not as frequent. This means that the mere frequency of a term doesn't tell us yet whether it is also a good search term in a specific database. And here's why. Imagine that we want to search for studies in PubMed. In this simplified example all our relevant references are indexed here. So let's check the terms from our example frequency analysis at search terms in PubMed. Some terms like participants would find all relevant references but it turns out that they occur also very frequent in the database in general. This means that these are frequent terms in our gold standard but compared to all index references in the database PubMed the terms are not more frequent in our gold standard and thus not very specific or precise. Let's check another term in our gold standard. This looks much better. The terms occur in all of the relevant references but it's not frequent in the target database PubMed. This makes it overrepresented in the relevant references. To sum up the goal is to identify those terms that are overrepresented in the test set so they appear frequently in our gold standard but they are not frequent in other references in PubMed. To find these terms more readily we created our R package. It aims at identifying overrepresented terms. Those terms that are sensitive for finding relevant references and that are also precise. So they do not retrieve many irrelevant references in the search result. Next up I'll walk you through the workflow in the shiny app of our package. First of all one imports a risk file with a set of references, the gold standard. Next up the package automatically conducts a frequency analysis of the terms and title and abstract and compares them to a random sample of references from the Medline database. In the end you can download a CSV to continue working with the results outside the app. The shiny app has multiple tabs. In the tab data import you are prompted to upload a test set which has to be in risk format. If the upload was successful the references, accession numbers, authors, titles and keywords are displayed on the table. To continue with the statistical analysis click on the action button analyze all references. We're also planning to implement a function to automatically split the test set into a development set for term harvesting and a validation set for a bias check of the identified search terms but that is not implemented yet. After clicking the action button analyze all references, switch to the next tab free text to see the results. All words are listed. Punctuation is removed but stop words like the of and are kept. Next it is possible to download a CSV file to work with it more flexibly. Be aware that this is currently a German CSV with semicolons as the limiters and commas for decimals. The terms that appear in the test set are compared to the frequency of the term in the random sample of PubMed references. The random sample is part of the shiny app. The table displayed is sorted according to the Z scores which is the test statistic of the comparison against a PubMed sample. This means that the over represented terms in the test set are on top of the list. I'll skip the options for analyzing mesh data. If you're interested in this option please contact me in the online Q&A or on GitHub. Let's stay with the free text terms and pick a candidate for the search strategy. In this example the most over represented term appears to be the acronym IQUIC. Maybe you're not aware what this stands for so let's see the keyword in its original context. This is possible in the tab Keywords in Context. Just type in the term that you are interested in. The resulting table lists all the occurrences of the term in its original contexts. You have the option to customize the window of words around the term of interest. It is also possible to truncate terms. So in this example the term search within Asterix will find all occurrences of search and its plural searches. The last tab phrases offers the functionality to explore n-grams to search terms that consist of more than one word. Here you can find out how many words are in between two terms of interest. In the first example you can see that phrases search strategy and search strategies always appear as a two-gram in the test set which means that they never occur with a word in between. The second example shows all the n-grams for the term non-inferior. So on top of the list is the frequency of the two terms non-inferior and conceptual as an n-gram and the table tells us that they only appear as a four-gram. So in our test set the two words non-inferior and conceptual only appear with two other words in between and never directly next to each other. This is actually very helpful in order to determine adjacency operators for a search strategy. So now you know what our package does. You can find and the source code on our GitHub repository. Well, there's still a lot of work to do and thus it's not yet possible to install the package from GitHub. We're definitely planning to do that very soon. So what are our next steps? We're currently working on finishing version 0.1 and the plan is to release it on GitHub in the first half of 2023. We are planning for version 0.2 where we also want to rethink our own workflow. We're very interested in feedback from others and from you who work with evidence of this. And so our long-term goal is finally to release a package on CRAN. So in case you would like to get in touch about our work, feel free to contact me either via Mastodon or GitHub. And of course there's also options for posing questions during the ESMA conference as well. So thanks a lot for watching this talk and have fun.