 Hello. I am Corrado Lanera, researcher of bio statistics epidemiology and public health at University of Padua. I discuss customer package and project, which explores machine learning techniques to improve comprehensive searches in systematic reviews. You can reach me through my institutional email, Telegram or Twitter And you can find our project developed by me and our units on our bugün space. Vse malo provorim telegram kanala Puffarbacco v poorlye vstavne informacije. Ogodno, pa kče smo prilmostili. Vse boljometo trdemi projecija veče z graču naspecji, projecija vsega tvoj, in kot preko točnji 체čaj. Da je prekro, prekro odprovezva bo knevno začneva, in t applause je vse sam razpolete projecije v srečji. Jedno začneva prepras vzroga. One of the most critical aspects of systematic reviews is to be as comprehensive as possible in the initial search, aiming to retrieve all possible evidence on the investigated topic. When it comes to clinical trials, most searches are conducted on standard and well-know indexed databases, capable of using complex searches. However, clinical trial registries are often underutilized, mainly due to their lack of search utilities, such as the absence of hierarchical branching structures, limited searchable fields, and few options for using complex queries, including combinations. This issue is far from trivial. In 2017, Bodar-Tetal reanalized 14 systematic reviews and meta-analysis conducted on standard database only, extending them to include clinical trial registries. They found changes ranging from a 10% to 50% increase in sample size, and up to nearly a 30% of changing in resulting statistics. The reanalysis took them more than one year and a half of work. So how can we make our search string suitable for multiple databases? We can use machine learning tools and techniques for that purpose. For example, when extending searches to indexed and searchable database, we can use tools like Polyglot to translate queries accordingly to their specifications. On the other hand, the idea behind customer is to train models or results from initial reviews conducted on standard databases when extending searches to known indexed databases, like registries. Then we can use these models to label registry records as relevant or not. To achive this, there are two main challenges. The first is converting the vast amount of testual data in a way suitable to training and applying models. The second issue is that the data set for systematic reviews are high unbalanced, causing classifier to be biased toward the majority class. In our investigation, we consider all 14 systematic reviews analyzed by Bodar-Tetal, including nearly 300 positive documents and sampling over 7,000 negative ones from PubMed. When then applied the trained models to label an entire clinical trials.gov snapshot consisting on more than 200,000 records. In our investigation, we consider supervektor machines, k-nearest neighbors, random forest and generalized linear modern net models. We also looked into both random, under and over-sampling techniques to achieve equal or one-third to two-third proportion between positive and negative classes, alongside benchmark models that did not manage the imbalance. We followed a strict flow chart for training using cross-validation to tune the model and obtain ranges of results, pre-processing data strictly within the cross-validation loops and keeping test sets completely hidden from the training processes. So far, we have published two papers investigating models results and the impact of the sampling techniques adopted, finding promising perspective in adoption these strategies for including registries in systematic reviews. The number of retrieved results to be manually screened was less than the amount retrieved and explored by Bodar-Tetal in half of the reviews with a mean of 472 records instead of 572, and a maximum of just over 2,000 compared to a maximum of slightly more than 2,600. Moreover, we missed only one single positive record while training models with errors under the curve over 90%. So, having explored the project, we will now take a look at the R package and how to reproduce the reported analysis. It is important to note that the primary goal of the customer package is to provide a mean to reproduce the analysis conducted by our unit for the project rather than to offer a tool for adopting our strategy and suggestions in one's own project. However, our scripts and functions can certainly be used as a template for further investigations as well as a basis for suggestions to improve them. The main requirements for running the scripts aside from having R and the package dependencies installed are Java and the server with almost at least 74 GB of RAM available. You can then clone the GitHub repositories and get all the code locally. Due to the size of data files, we provide a separate repository for source, intermediate and resulting files. You will also need to download these files and place them in the corresponding position with the local project. All the supporting custom functions are in the R folder of the project with their corresponding implemented tests in the test-test-det folder. In addition, you can find the latest version of the script used for the analysis in the Hatch analysis files. You can choose to rerun the analysis from scratch or download intermediate results from the data repository to skip lengthy computation. It should be mentioned that the entire project was developed some years ago and both machine learning techniques and best practice have since improved. With that in mind, we are now refactoring our code base adopting our current opinions on best practices and strategies for reproducibility and easy-to-use and development. The main improvement includes adopting RENV as a package encapsulator for the project and target structure to allow cache the results, parallel computation and greater reproducibility. While the original branch of the GDA project will continue to showcase the original code used for the analysis, the target refactoring branch offer a more modern implementation which is easier to run and follow. However, this branch still reflects the same strategies and reproduce the same results as the original one. With that, I have concluded my presentation. I thank you all for the attention and I am grateful to HismerKonf Organizer for giving me the opportunity to present our work here and especially Nial Hardway for his extreme patient with me. And I look forward to your question and suggestion. Thank you.