 Okay, we're now live streaming on YouTube. Okay, thank you. Welcome to Esmar Conference 2023 and this workshop on testing semi-atomated deduplication methods in evidence synthesis. My name is Edith Kandarani, I'm in today's moderating the session. This workshop is being live streamed to YouTube and has a group of participants taking part live. It very warm welcome to you all. And if you have any questions for our presenters, you can ask them via the atES hackathon Twitter account by commenting on the tweets about this workshop. If you also registered for the workshop, you can ask your question here in the QA facility and you can also comment and chat with other participants on our dedicated Slack channel that was sent along with your registration form. Our today's workshop presenter is Kirsten Dismar from Free University in Amsterdam. Kirsten, it's very pleasure to have you here and over to you. Thank you. Thank you so much for having me and thank you for signing up for testing the deduplication workshop. But first of all, I wanna say a big thank you to the Esmar Con organizers for setting up this amazing conference series and making it available to watch and re-watch whenever you like, wherever you like. Because as a mom of three, this allows me to learn and stay up to date with the latest developments in the evidence synthesis community. Just on the couch when my kids are asleep. So thank you so much for making that possible. As you already mentioned, my name is Kirsten and I work for the Freyja Uniper State Amsterdam and the Amsterdam University Medical Center as an information specialist, supporting literature research in health and life science. And before I joined the evidence synthesis community, I was using R to analyze ancient DNA as a biomolecular archeologist at Leining University and the University of Oklahoma and University of Copenhagen. I wouldn't be presenting this workshop on such an interesting topic without the other team D-Dub members, Krista, Elise, Erika, Flor and Sabrina. Thank you so much for this amazing team that we're in. A big shout out to Flor, who is our latest member who will also be supporting in the chat. She adjusted the Esmar logo to fit all of our team in it and adjusted it to the Esmar Con color palette. Thank you Flor. A little bit about our team, we were all part of the local organizing committee of the E-Heal 2022 conference in Rotterdam. And during one of the conference days, we started chatting about different de-devocation methods. We all kind of used the same method, but we had slight differences in opinion, which references to keep and which to toss. And although we're all using some variety of a manual end note method inspired by Vichor Brahmer's method, we had different numbers of references to screen. So that's why we started team D-Dub day to figure out which method we could all use was reproducible, get same results, transparent and preferably without too much manual work. The purpose of this workshop is a run through of de-devocation methods using R. And before we do that, we have to agree on a duplicate. What is the definition of a duplicate? Then I'll show you some of our preliminary results of our literature research and the results from Dutch D-Dub day. We have identified several semi-automated de-devocation methods and performed some technical analysis on it. And this workshop, we will perform a de-devocation on an available small dataset consisting of 275 surfaces out of three databases and a medium-sized benchmark sets consisting of 6,602 references out of five databases using the R package assist. If I'm going too fast or slow, please let me know. Also just ask questions anytime, post them by the chat and they will be responded to. So why should we care whether we de-duplicate or remove duplicates at all? Well, if you wanna conduct a comprehensive literature review, you need to search multiple overlapping databases with overlapping contact. Mostly it's through various platforms provided, providers, for example, Medline via PubMed, Medline via Ovid, M-Base via M-Base.com or Ovid. And as a result, you can come across duplicate references and with proper de-duplication, we can reduce review or workload, thereby limiting potential bias and also avoid the unintended removal of eligible studies, limiting potential bias. By proper de-duplication, you can also have a more precise assessment of the scope and depth of literature, where you can already guess limiting potential bias. But duplicate removal or de-duplication is time-consuming, labor-intensive and resource-intensive. Okay, we now know that it's important to remove duplicates, but now I wanna know from you guys what you understand as being a duplicate. What is your definition of a duplicate? You can go to menti.com and enter the code. And I have the screen right here. And you should be able, I'll pull it bigger. She can actually see it. And I would love to see some answers about your definition of a duplicate. I hope everyone can see the code. Go to menti.com, same DOI. That's the first response that we see. And hope to get some more numbers. Hopefully there are some more responses than just the same DOI. But for, and the same journal article, identical bibliographic information. Oh, great, I have five and they're coming up right now. The same DOI in titles, same content, same titles. I also asked this to a common chat that you might all be familiar to. Here it is. I asked chat GPT, what is the definition of a duplicate? It replied, a literature reference duplicate refers to a situation where two or more references in a bibliography or reference list refer to the same source. Well, I didn't just trust chat GPT. So I also went into the references and into the research and Caitlin Hare, the amazing developer of assist defined a duplicate as the presence of two or more citations representing the same publication within an aggregated systematic reviews for search results. Even where those citations differ subtly in recorded details. And also a little older Rathbone and coworkers in 2015 defined a duplicate as a reference that has the same bibliographic record irrespective of how citation details were reported. So as you see, all three of them report that there is something similar. They are the same referring to one and one thing only while having some subtle differences. To illustrate this, I have this slide throughout the workshop with examples of potential duplicates by using your raise hand button in the Zoom screen. You can let me know what do you think this is a unique then you raise your hand or a duplicate. Here's the first example. So please, if you think this is a unique paper please raise your electronic hand to indicate that you think this is a unique reference and I should be able to see that based on, yay, I see some participants raising their hands. And that's great, because indeed this is a unique paper. It has different authors, slightly different titles and different page numbers. So you can quickly see that this is a unique reference. However, this is manual work and think of this as times thousands doing this by hand is very time constraining. And especially nowadays since literature reviews in all its forms and shapes become increasingly more important, but the increase in literature and the higher complexity of the search topics results in thousands, 10 thousands of hits meaning that more time, more labor and more resources are required for evidence synthesis. Therefore, we need transparent, reproducible and efficient automated solutions to work out whether it's a duplicate or a unique reference. Here it's illustrated again with a bar chart of the exponential increase of references added to the largest biomedical database PubMed over the past 30 years reaching a high of 1,768,793 references in 2022 alone. Well, try deduplicating these kind of numbers by hand. Now that we know it's important that we need automated tools. It's also important to become aware of what the Prisma S statement indicates on how and what we should report regarding duplicate removal. The Prisma S states that knowing which method is used enables readers to evaluate the process and understand to what extent these techniques may have removed false positive duplicates. I'll come back later to what a false positive specifically is and why it's important to not remove these. Authors should describe and cite any software or technique used when applicable. Because if duplicates were removed manually, authors should also include a description. Remember, while all team D-Dope members describe the methods in the searches we support, we noticed that we had differences in opinions about certain aspects. Therefore, there is a need for transparency and consistency in deduplication methods. Here we see the Prisma flow diagram, also part where you have to report how many duplicates were removed. And it's more specifically, it's right here in the text box. It says records removed before screening. Note here how many duplicate records were removed before screening. So to summarize this first part, we know that databases contain significant overlap. Duplicate removal is time consuming, labor intensive and resource intensive, which is all further exacerbated by the increase in scientific literature each day. So we are aware of the need for duplicate removal, the growing need for automated deduplication methods. Let's continue with our search results. Are there any questions thus far? Great, let's continue. So we developed a search strategy for four different biomedical bibliographic databases here illustrated by our MBA search strategy to search for deduplication methods that are published in the literature. The search strategy is represented by the PubMed syntax available via GitHub. And what we see is that we look for free text words in the title abstract and keyword of Ddub or any synonym near three of review literature reference record or citation. And based on this search strategy, we got 12,603 references, excluding the 974 conference abstracts, which were available in embays.com. And after deduplication using the Brahma method, 5,021 references were made. And then I have another one of these. Is this a unique one or is it a duplicate? Please raise your electronic hand if you think this is a unique reference. Look closely to the PubMed on the left and then the Web of Science on the right. And actually this is one to debate. You can view this as a unique paper as it is the supplementary material to the paper, but some might also remove it as it belongs to the research history of the paper. And I will tell you more about this research history of a paper later on, because it became one of the most important topics from our panel discussion during Dutch D-D-Update. Didn't see many hands though. So most of you indicate this as a duplicate paper. It's good to know. So we have the results from our literature and we needed to screen. And we applied two types of screening procedures. One by Roya, which is a collaboration platform, easy to use. It's a thumbs up or thumbs down system for inclusion or exclusion of references. It allows for blind screening. And since we are a team of six, we also included settings such as all references should be seen by two or less, apply a ranking to indicate the algorithm relevance. And we learned so much by going through this ourselves. Then our other procedure was through AS review. We applied ranking to potentially relevant titles and abstracts using the default settings. So the feature extraction technique was TF-IDF, classifier, we use naive base, and we use the maximum query strategy with a dynamic resampling balance strategy. We stopped screening after 100 consecutive non-relevant abstracts were reached. And with this slide, I wanna give a special shout out to Krista, our team D-D-Up member who was the only one that finished screening all title and abstracts. Thank you, Krista. And then after screening, I have another one of these unique or duplicate papers. Rachel's your electronic hand, if you think this is a unique paper or do nothing, if you think this is a duplicate. And if you look closely, this is indeed a duplicate paper. I don't see any raised hands. It's just different formatting. You see that the 112 to 130, the 13 is just abbreviated here. And there's a different formatting. Krista, there's one question that came through Slack. How well do deduplication tools work for preprints, journal article versions of a paper? Sometimes title, et cetera, might be different, but the content is the same. Well, come to that in just a little bit. So just hold. Perfect, thank you. It'd be a little bit more patient. Thank you. No worries. On top of our literature search in our bibliographic databases, we did some additional searching by Googling for unpublished methods. We looked for preprints. We did a direct get up search, resulting in 421 results. And we looked for CRAN packages and resulted in four different CRAN packages. And overall, based on our literature search, our searching GitHub and the CRAN packages, we identified 22 literature reference deduplication methods. And these methods varied from manual to fully automatic. And we identified four methods based on our code. For this workshop, I will focus on one. It's assist develop an amazing Caitlyn here. Thank you so much for developing it, Caitlyn. It really helped us out. And to get a complete overview, I'm also wondering whether you might have missed some of the methods. So if possible, please go back to Dementi and fill in the deduplication methods that you've used. Here's an overview of which methods we used before we started. Our method of choice was, for all of us, the Brahma method. I also used the Amsterdam efficient deduplication method where we could subtract the PubMed accession numbers from different databases. But all in all, you could see that none of these deduplication methods were automatic. They're all manually and very time consuming, especially when working with the increasing numbers of results per search. And it's really nice to, I hope that you have some answers here already. We will all collect them. I will make the screen a bit larger and go to the next slide of this Dementi. And hopefully I see some methods appear. Just looking to see if we're complete in our overview of 22 references. Brahma deduclick, great. We'll save them all and analyze them later on. And hopefully we find some new methods that we can test for our research. Sorry about switching screens all the time. I hope that's not a big issue. There we go. All right, so based on our search, we have 22 deduplication methods for our basic deduplication methods. And we're gonna continue here with the R package assist. And before we start, I wanna have another unique or duplicate. I'm trying to make it harder for you guys to see the distinction between the unique ones or the duplicate ones. Please raise your electronic hand if you think this is a unique reference or do nothing if you think this is a duplicate reference. I don't see any raised hands, which is great because this is indeed just a duplicate. The only difference is that the title from PubMed is uppercase. So for us, this is really easy to see. How can we tell the computer that this uppercase title is the same as the lowercase title of Web of Science? All of the deduplication methods have standardized steps. For example, the preprocessing and the example that we just saw for normalizing and upper and lower casing all of the titles, the page numbers, the author names, the journal names, removing punctuation or special characters to have it normalized, to have it all similar to each other in the same format. Once the formatting of the data is done, most of the methods use a combination of metadata fields. For example, a combination of title, of page numbers, of author names, and see if they are similar or not. Some, but not all of the methods that we found use similarity scores to assess how similar certain combinations of fields are. For simplicity, I will not explain these similarity scores in depth, but if you're interested, most use Jarrow-Winkler or Levenstein similarity scores. Another one, is this a unique one? Please raise your electronic hand, or is this a duplicate one? Please do nothing. What you see here, it's a duplicate because we just see the special characters which are available for the Web of Science reference, but not for the PubMed reference. And now we come to limitations, because most deduplication methods have limitations. Some have size limitations. So for example, I can only deduplicate 50,000 references, but what if you have a search that resulted in 100,000 references? What to do with those? Others require specific skills that are necessary, limiting accessibility or some of the identified methods are licensed or otherwise not publicly available. Some of the methods do not preserve all metadata, and most of the methods require an in-between reference manager, for example, EndNote or a Menley or Zotero. These are all limitations that we have to take into account when evaluating the different deduplication methods. There are also technical performance variables that we took into account when evaluating these methods. We used the true and false, negative and positives. So to illustrate this, here is a nice overview. If the reference was correctly flagged as unique, it was a true negative. If the reference was correctly flagged as a duplicate, it was a true positive. If the reference was incorrectly flagged as unique, it was a false negative. Incorrectly flagged as a duplicate was a false positive. And it was this false positive that is especially detrimental for proper deduplication. It might remove the eligible studies and therefore result in missing these eligible studies, changing the result of your evidence synthesis. So proper deduplication actually removes the true duplicates and keeps the true uniques. There are other ways of evaluating the technical performance by looking at the sensitivity, so the proportion of correctly identified duplicates and the specificity, the proportion of correctly identified uniques. Also looking at the accuracy, the proportion of correctly identified references in relation to the benchmark are all possible measurements on how to evaluate the technical performance of a method. So for this part, we wanna delete the, remove the false negatives, actually the duplicates. We wanna keep the false positive, actually a unique one, and we're measuring sensitivity, proportion of correct duplicates, and the specificity, the proportion of correct uniques. Are there any questions about this technical performance evaluation? I think it will also become clear once you start to work with the actual demo. All good, Kirsten. Also, Floor is responding to great questions and also in the chat. Great, thank you so much. So we figured out, we identified the duplicate removal methods and the way we'd like to analyze the data, and that's when we organized Dutch D-D-Day, a national workshop to generate the data on technical and user performance with a special focus on consistency and the quality of the data. So we decided to do this with a special focus on consistency and learning curve. The 30 participants of Dutch D-D-Day were medical information specialists who regularly performed deduplication or advised researchers on deduplications. And the participants were divided into five groups. Each of the groups were assigned a selected deduplication tool. The five tools selected were the, I'm saying dedupe end note, we had set up a small and a medium benchmark set, which will be the same as we will be using today. And after the deduplication process, all participants completed a questionnaire that included the 10 questions of the validated system usability scale. And the SUS questionnaire measures the usability of a tool is reliable and widely used questionnaire. And it included an additional question about the participants' feelings about the reliability of the tool and whether they had any other comments about the tool. And we had this great international panel discussion, very interesting panel members calling in from all different time zones. We had Caitlin Herr, the developer of ASSIST, who is a postdoctoral researcher in the Camarades team at the University of Edinburgh. We had Justin Clark, a developer of the SRA deduplicator, a senior research information specialist at the Center for Research and Evidence-Based Practice at Banke University, Australia. We had Beatriz, who is also joining us here today, and Puria Amini from deduclick. Beatriz is a librarian at the University of Burn, and Puria, the CEO of RiskClick, the company behind deduclick. We had Wiescher Brammer on location that day. So it was a fully hybrid panel discussion, which was really nice. He's the developer of the Brammer method and biomedical information specialist at the Erasmus MC in Rotterdam. And we had Gerhard Lobestal, the developer of the AMC dedupe EndNote. Gerhard started in the medical library at the Amsterdam UMC and is currently working as a software developer project manager. And today I would like to share two of the most important insights obtained from this panel discussion. They are related to the study types and who should perform the deduplication. Because when conducting a literature study, the aim is on the one hand to find the reference with the highest relevance for the research questions. While on the other hand, the aim is to reduce the number of references to screen for the screeners. So the screener can spend time on the actual evidence synthesis. And during our first talk at E-Hill with the team Ddup members, and also during our panel discussion, we observed that manual duplicate removal resulted in different numbers of remaining references due to a different point of view regarding the contribution of the type of reference to the end result. We had discussions around whether to keep or toss preprints, whether to keep or toss eradas, conference proceedings. And Visha, for example, indicated that the full text article replaces the conference proceeding. You could always ask the authors for more details. And Beatriz, on the other hand, indicated that, well, this is something you should incorporate in your search strategy and not let this be part of your deduplication process. Although we have this difference of opinion about these different publication types, where most of the false positives typically occur with conference abstracts being merged as one. And we have from deduclick indicated that, well, perhaps it's better to use a conservative approach and keep all original references. Taken all together, one should always wonder how important is this publication for my review? From a researcher point of view, it's important to only keep the most recent and when selected for inclusion, check what was published on this research before, check the research history of the paper. A conference paper, for example, is an earlier version and part of the research history of the publication. And this research history is only relevant when including that specific article. It's not relevant when you're excluding that article. It just costs you more time. We also talked about the ideal situation about how ideally you would attach all references to the regarding literature reference per research, creating a literature research history of a particular paper. So that was one important insight that we got from our panel discussion, building a research history of a reference to decrease the amount of time spent on screening. And this is one of the examples that we were debating both at E-Hill and during the panel discussion. This is, on the left hand in the PubMed, a preprint. And one might argue that it's a duplicate part of the research history of the actual paper in the Web of Science on the right. But you could also argue, and please raise your electronic hand if you think so, that it's a unique paper. Two references, they're not the same reference. They are referring to two different references. And I do see in the, by raising the hands that some of them indicate that this is indeed unique references and not building a research history and duplicates. I have two of them right behind each other. Do unique or duplicate questions? Because this is also one that stirred the debate. It's a cockering review update. And would it be important to see these as unique references? Please raise your electronic hand or we could keep one over the other. For example, keep the most recent one instead of the latest one. So keeping publication three over publication two. I also see some differences in opinion in the raising our hand. I see 12 of the people, 11, 10, some people haven't made their mind up yet. That's okay. So we have 10 who say this is a unique reference, but out of the 30 participants in the chat today means that others indicate that they might be seen as a duplicate. And that brings me to the other important insight that we got from our panel discussion. Who should perform the duplication? It all depends on from what kind of point of view we're looking at the deduplication. Did a click and the AMC did a note work like a branch are the literature references different? Well, Caitlin and Justin work like researchers. Well, the conference abstracts and the full article is twice the work when reviewing. So they are duplicates. There are different points of view about this. And for our research looking at the evaluation of the deduplication methods, we would definitely like to know who should perform the deduplication. What is your opinion on, should an expert information specialist or a librarian perform the deduplication? Should the researcher itself perform the deduplication? Just Justin mentioned that reference management is seemingly extremely difficult. So it should be part of the specialist's role. But with the development of tools, especially well-designed tools, there's no need for the skills of the librarian and it could become part of the researcher's role. So I'm definitely interested in seeing what your opinion is. Oh, I see a lot of people responded to the different methods, which is great. Sorry for having, I hope you don't see all the different chats, there we go. And we see from the people in the participation today that we have different points of view about who should perform the deduplication. It can either be the librarian three, researcher two, tool dependent two, or someone other. Please post in the chat, what do you think who the other should be? It shouldn't be the researcher, it shouldn't be the librarian. Please let us know who should perform the deduplication. So to summarize this part, we have DutchDWDA, a national workshop with 30 participants testing five methods to assess the technical and user performance. And also the user experience, we had a special focus on consistency and the learning curve. We had an amazing panel discussion. Thanks again for all the panel members that were available then. And we had two important insights regarding study types and who should perform the deduplication. And now I've talked way too much. It's time to start deduplicating, because you can either watch or work alongside with me. For the watch, I'm gonna deduplicate the small data set consisting of 275 references. And then afterwards you can also work and deduplicate the medium data set. I would like to ask you to please keep everything because this could help our research in evaluating deduplication methods. And let me know if you have any problems along the way. I'll first tell you something about our benchmark set because we needed to create a benchmark set. And that was kind of difficult. We were looking for published searches and first a small number and then a little bit larger number of references. We wanted to spend multiple databases. It was consistent health and life science databases, multiple publication types because of the discussion that we just had spending a longer period of time. And we came to these two searches, the small one consisting of 275, four databases and the medium one consisting of 6,602 consisting from five different databases. And in this slide, I would like to give a special shout out to Elisa Esobriña, who manually deduplicated these benchmark sets first independent of each other and later cross-checked them. They had a particular order of import for the small and the medium one going from in-metalign to Web of Science and from in-metalign to Sinal. And then they deduplicated manually in EndNote version 20 for selecting author year and title, then title and then author year. Esobriña is also online today. So if you have any questions pertaining to the benchmark set, please let her know and she can help answer all of those questions. So this is what we have. And now we have to deduplicate these sets. There is a possibility to do this using the Shiny App Assist. And there's no need to code. If you're not familiar with R, this is the right solution for you. Caitlyn has set up this amazing tutorial. It's available as of today, 4 a.m. Amsterdam time. So all the nightbirds can watch it but it's also available online. I think it will stay online for quite some time. And it allows for the XML upload off your references then automatically detect and remove your duplicates. You can specify which citations to retain and download the different citations. So please have a look at the tutorial if you're interested, not encoding, but in deduplicating your reference sets, have a look at the R Shiny App Assist. For that, you do need your EndNote XML and we're gonna need the EndNote XML to work in RStudio as well. They are available via the ZERF drive and they should be available through the link in the chat as well. You can follow this link and what you see. Open it up again. Is that there are two files in here. It is both the import files and the export files. The export files is where you can keep your export results so you can help us with our research. And in the import files, you will see the small dataset and a medium dataset folder. And we also have the script that we've been using, the R Markdown file, so that can help you guide along the way. For the demo, I'm gonna use the small dataset which is this dataset. And you can do it two ways. You could either download the entire folder and have all files or you could just use the XML file that's also available. For the work part, we're gonna download the files in the medium dataset and for the demo part, I'm gonna use the small dataset. Either way, it's all available in the ZERF drive. And I think all the link issues are being sorted right now in the chat. Thank you so much for supporting guys. So we have to open up EndNote. And here is my EndNote in just a sec. And here is my EndNote. And I've set it up so that all of my databases are in a special group. And the only one that I need to import is my Cochrane file. So I'm gonna go to import and I downloaded my small dataset and here is the small Cochrane file. And what you see immediately, yeah, I have a Mac, I'm sorry for that. Under import options, there are many options, also other filters available. Do it as you usually do, do however you like, but I just wanna give a notice on be aware that there are many import filters. And I'll tell you later why we run into this being kind of an issue. Import them, select them all and then select and export all of the references to the specific folder as a XML file. There you go, great. So now that we have our file, our XML file, we can go to our studio. And in our studio, I've opened up my R Markdown sheet with all the steps that we would like to do. All right, and we're gonna start with setting our working directory. So I wanna set my working directory to my import file folder. And then to install my assist package, we first have to install DevTools to pull the package from GitHub. Then I'm gonna pull it from GitHub and in order to use it, always load your library. You guys are in the screen. I can't see if it's properly installed and I'll pull the screen a bit larger so you can see all the way to the bottom. So what we just did was we set our working directory, set working directory path to the folder and we wanted to get the assist package from the GitHub. We installed the assist package using install.packages, first DevTools and then pulling the assist package from GitHub and loading the assist package into the RStudio. Then our next step is our XML file. And here's the code for it and I'll show it later in RStudio is that we can change the method argument in this citation data where you load the search by using different kinds of methods. You can change it to, for example, CSV files. Right, let's see. And there we have our citation data. Great, so we have our authors, our year, our journal, our DOI and as you can see, it's 275 observations and we have 275 references of 16 variables. That's great, we got that in. We loaded our search. Now, now that we have our search, we also wanna dedyplicate our search and we're gonna dedyplicate our search by using the DDEV underscore citations. After we dedyplicate it, it will return two data frames, it will return the unique citations after the automatic removal by assist and it will also return citations to be manually dedyplicated. And those are available right here. We go to the assist. I see that the chat is well manned right now. Thank you so much for helping me guys. And we go back to the R Markdown Sheets. So we have the citation data here. Our next point is dedyplicating the citations based on the citation data. And it's now flagging our potential pairs for manual dedyplication. What you see in the output is that there are 275 citations loaded, 105 citations were removed and 170 unique citations are remaining. Well, that's great. I'm gonna see how it looks like. So it's a large list of two elements. Just like I told you before, we have the unique citations and we have the list of the potential duplicates. So to get the unique citations, we need to use the DDEV citations and specifically look at the unique one. Here we see 170, it correlates with these 170 unique citations. We see that it got a duplicate ID, a record ID and still the same information from the authors. Now that we got our citations, there are two potential things that we could do. We could either review them and the reference manager, which is my preference, or you could use the get manual DDEV citation data frame and review them in our studio. For simplicity, I'm just gonna use the reference manager and show you guys how I like to do it. But in the script or to our markdown file, you can also see how you can review your citations in our studio if you'd like. I'm gonna write the citations to the file. You can change the type of citations, so you can either do it in the text file and RAS file or a different kind of format, everything, a lot is possible. And we're gonna write the unique citations to our file. And then I'm gonna open up end note again. And here is my end note. I already prepped my assist folder and I'm gonna import going for import files. This is the file that I just wrote. It's a RIS and RAS file, import all. And here too, we have so many different options for importing and exporting that became a problem for our Dutch DDEV day. And then you can review it manually in the end note reference manager. Or if you prefer, you can also review it in our studio. There are two options available. If you're reviewing it in end note, you can import it in end note, review results. And for example, look at title, title, title author, title author year through to find duplicates option in end note. This was a very brief demonstration of how our studio or how the assist package works in our studio. For the work part of this workshop, I'd like to ask you to de-duplicate the medium data set. The benchmark set is available via the surf drive. And I'd like to give you guys 20 minutes to de-duplicate and hopefully also help us with our research and export everything to the export files, perfectly with the naming the date of today underscore which manual method you use and your initials. And if you don't want to use your initials, just use ABC. And then if there are any questions, I will be available via the chat, but I will stop my video. And now is also the time if you want to grab a cup of tea or a cup of coffee and take a break. And that's also fine, but I will be available for answering any questions in between. Thank you, Kristin. So far, yeah, Flor is doing a great job sending everything on Slack and chat at the same time. Thank you, Flor. Yeah. There was just one comment at the end on Slack saying, since some of you here, probably on the Slack chat, chat ref tools, I was wondering if the ASYSD has some pros, cons over using ref tools, for example, that was just the latest comment we had. If you have anything to add on this, thank you. Thank you. Yeah, we're still finishing up our analysis to see what the pros and cons are of each method. And my last slide will actually indicate why we pick certain methods over other methods and why that's so dependent on the different variables that, for example, how your search looked like, what your research question is, how big your team is. So we'll get to that in a little bit. But thank you for posting that question. Sure, thank you. And actually, thank you so much because the very first question I asked you, you actually went through it as well afterwards, the on how well-due duplication tools work for pre-apprents. And yeah, thank you for that journal article version and so on. So yeah, mostly everything tackled. We don't have, I can't not see anything on Twitter. I see a question in the chat. That's that, could you share your code again? The code is available in the surf drive as well. Yeah. Yeah, I can see also Erica is taking over with. Yeah, thank you so much. Thank you so much. Team Ddub is the best team there is. Thank you so much. That's very helpful. This is really it so far. Thank you. Oh, he's asking again, where's the code? Sorry, Kirsten. Okay. I think, yeah, surf drive. Yeah, and the import one. Yeah, this is where, yeah. Thank you. Thank you. There's a question here now that just came in, why there are cellular files in the medium folder? There are both RIS files available. If you wanna use your personalized import filters for EndNote, then you're able to do that. But you could also just go ahead with the XML file and just use that one as a get go for starting the RStudio Assist package. And you don't have to make it in the XML. Yeah, great. Thank you. Thank you for your question, Valentina. If you have several sources, how do you combine them before to duplicate? That's actually the issue with the in-between reference manager where we always need the in-between reference manager as I demonstrated that I made different groups. And then I selected all references, all 275 references and exported them. So that's how I combined scopes and Web of Science in one file. Hope that helps. The XML file is an XML export from EndNote and not from my library of interest. A general comment in my experience from Joseph is using reptiles and the function find underscore duplicates, it works really well, but there's still a need to check that two references attacked are indeed duplicates and not uniques. Have you, do you have any tips on how to automate that second step? Well, Joseph, we're kind of still working on that. We'll show in the data analysis of Dutch data data, that is exactly what we ran into. There is this constant need to manually check whether two references are duplicates or not. And it's really difficult to tell using code whether or not it's a duplicate or not without merging, for example, to unique references together. Kind of answers your questions, hopefully. And as always, I'll also be available afterwards to talk and chat about this a bit longer. Oh, I think there's another question from Valentina about the, if you have, if you use several sources, how do you combine them before to the duplicate? For example, I often use Copis and Web of Science and I found difficult to merge the two sources. And then, yes, but I actually didn't get the answer and did you merge the search and end note? So, yeah, there was a few questions from Valentina here as well. Oh, I see, yeah, they're collapsed. The answer to the question of Valentina was that I used the in-between reference manager and note. And note, I have different groups. So a group for Copis and a group for Web of Science. And in the end, I select all references, including Copis, including Web of Science and export them as one file. Hope that helps you with your answer. And Kirsten, I actually have a question. So do you duplicate them when you have them and note before you take them out? Do you duplicate them, let's say in GitHub or? No, no, I don't deduplicate. In end note, I use assist to deduplicate it or the method that I'm testing at the moment. So for, it's just to get one file, including all references from the different databases. Okay, great. The RStudio, so use the RStudio for that. Yes. And do you think you find it also more efficient to do that on RStudio than actually what end note can detect? Because end note can still not detect all duplicates. That's true. There are different options to work in end note, but end note is a licensed product, which is really great for it being available, but only when you have the ability to access it. So many people are not able to access end note. So we were also looking into different methods on how to work with references. Also, we were looking at larger datasets and we wanted to have a method that could also deduplicate really large numbers of references. And we're just taking note of all the pros and cons of all the different methods that we're looking at. And for this conference, our package assist is most suited because it's an evidence synthesis and meta analysis in our conference. And we are working on a manuscript that looks at more than just the R-based packages. We're trying to find as many deduplication methods as possible and get a comprehensive overview of all the pros and cons of each method. Great, thank you. Benjamin, thank you for your question. You say that you cannot seem to find the write underscore citation file. This is a function that's available through the assist package. So if you were able to deduplicate the citations and load the search, you can also call on the write underscore citation function. It is in the assist package. Oh, sorry. Also there was a follow up question from Valentina about what are you aware of an easy way to do it in R because as you said, and note is a license software. We're currently exploring that option to see if it's possible to import from different databases into R or other open access options. I'll keep you posted and have a keep an eye out for our manuscript that we will hopefully be submitting in the summer. Thank you, Kirsten. I think also Erica is referring to the same problem with Benjamin as she's having the same problem or maybe it's the XML file. I'm a bit confused because there were a few things. All right, let me open up RStudio. Let's go back and see what the problem is. Thank you. Okay. And if you want to know more during your RStudio and you're not quite sure which prompts you have to use, you can always look at the packaging and look for assist and check the right citations. This is the one that we want. It should contain the right citations, your unique underscore citations and the type of citations that you got and you should write it to your file name. So those are the comps that you give it, the right citations, brackets, parentheses open, unique underscore citations type. I'm going for TXT, but you could also use the RES and I give it the file path to where I would like it and now it wrote my citations. I'm just going to go back to the PowerPoint and see how you guys are doing, if that helped in the chat. Oh, and it was default. Okay, sorry about that. Oh, great. Caitlin is now also looking at Slack. So if you have any assist specific questions, she can help you out as well. Thank you for joining us, Caitlin. Benjamin asked, how is the threshold set for the assist? So assist goes through multiple blocking rounds using a similarity score. And if you'd like to know the exact similarity score per round and which metadata fields are incorporated in each blocking round, please have a look at Caitlin's paper, Harrydall, it's published on bioarchive and I'll have it in the references at the end of the presentation as well. Thank you, Kirsten. I'm just a question as well. I think you've responded to the similar question just earlier, but I just probably missed it. When we talk about the XML file, it doesn't have to be EndNote XML, correct? Well, we are still working on that because we tested during DutchDWD using different XML files from RefForce. I think we tested, Menlea was also an option, but we noticed that the different deduplication methods that we tested during DutchDWD, we're in that compatible with other reference managers than EndNote. So we're still exploring that. Okay, thank you. So this is just a normal XML file at the moment. Yeah, oh, there's another question from Wolfgang if the slides will become available to public. I will also put a PDF at the PowerPoint on the surf drive afterwards, yes. Thank you. So for the last five minutes, I'm just going to grab a cup of tea and I'm going to turn out my video for a bit and then you guys can answer these questions either in Slack or via the chat and the other team Ddup members that are online will definitely help you and otherwise I'll be helping you when I get back. Great, thank you, Kirsten. I think that most of you are done with the deduplication. If not, you can always finish it afterwards and please help us with our research by uploading your expert files to the surf drive folder. And that is also a question in the chat about the results from ASSIST and that is what I want to continue with the preliminary results of ASSIST from Dutch Ddup Day. So we deduplicated the small and medium benchmark set during Dutch Ddup Day. And what you see is that we had different numbers of results. After manual review, we had 107 duplicates remaining and most of the sets, some of it had 106, could also be due to the different opinion regarding something during the manual review. And also for the duplicate of the medium benchmark set we also had some inconsistencies. The duplicates were not consistent across the participants but our first impression during Dutch Ddup Day was very positive and it received a sus score of 7.3 out of 10, which was great. And this is how our results look like. You can see this is from an end node screenshot where in the custom one field we put all the different Dutch Ddup delegates and then in the caption field we inserted what method was used to deduplicate all of the references. We see our data and we also see interesting part about the enormous impact that personalized important expert filters have on the results. As I showed you during the demonstration there are so many options available, so many personalized personalization methods available. And even though we use the same deduplication method on the same benchmark sets, we do see slight differences in formatting between the different participants. And we are relatively new to the R research field but we quickly figured out that developing an R script to analyze the Dutch Ddup Day results was essential. Well, it would really be possible to manually go through all the results. Yeah, it would be very time consuming and especially when you think and look forward to larger datasets, it's kind of, well, we have to move over to R script that could help us work through this data. And this is a working progress and we definitely need help. Thank you, Erika, here's your shout out. Thank you for getting R script to work but here's the aim of what we wanted to do. We wanted to have all of these different Dutch Ddup data in one file and we replaced our caption field with the method indicated. So for the assist, we replaced the caption field with assist and we had the benchmark set in the caption field as well. And what we wanted to do is we had the title and author and the caption field. Those were our most important fields and as you can see, we wanted to group similar title and authors together and while retaining this caption field, so we would count, be able to count how many times certain title author combination was seen by a certain method. Because using this, we could evaluate the technical performance and calculate the true negatives, the false negatives, the false positives and the true positives. And we have set up a script and we ran into this formatting differences that the title and author combinations weren't 100% match. And a potential solution was based on both the assist package and the AMC Ddup and node method to kind of label clusters of duplicates based on a similarity score. But what is the similarity score, sweet spot? What is that threshold? And when we're just discussing this in the chat as well, what is the threshold to say, well, this is a unique one and this is a duplicate one. So we're still exploring this script and we're still working on this and we definitely hope that one of you can help us in coding. If you're looking for a challenge, please contact us. We have our contact information on the last slide because we really wanna use an R script to evaluate the technical performance and get these numbers out of the data that we have. But then for future deduplication strategies and mainly where we have our method of choice. Well, for deciding which method of choice we have, we took the following key points into account. So your deduplication method can depend on the research question that you have. It may depend on a required publication type. So a search developer conference abstracts, higher false positive numbers. And if a particular method is known for having a higher false positive based on these conference abstracts, then that's not the method that you wanna pick if you're specifically looking for these conference abstracts. But also looking at availability, well, we just had this conversation in the chat about end note being licensed. And if you're not able to use a licensed product, what open access products are available. We should also consider our downstream analysis and interoperability. For example, when you're screening the data or extracting data, some of the methods require skills that you don't always have. So all of these key points should be taken into account when you're deciding which deduplication strategy you're gonna apply to your search results. Also a big thing is, are you gonna go manual or are you gonna go fully automated? It can also depend on the size, the size of your team, how you can divide your references, but also the size of the data set to be deduplicated. A small one might be quick to just do a manual check. For now, we have the method of choice before and after we started the project D-W-Day. And we have some that are not ready to make their decision yet and we're gonna look towards the developments that occur in this amazing field. For all, we wanna start the discussion about which deduplication methods to pick to have transparent and reproducible deduplication methods. And it's important to understand that for the deduplications where we had missing data, we wanna find out if it's possible, if it was an eligible study, whether we can find the full text or for the missing publications. And these are points that are on this slide right here that we're gonna explore further. Look for the consistency. We found some inconsistencies during our Dutch data workshop related, possibly related to all the personalized import expert filters. We're gonna explore the research history of papers. Would it be possible to emerge all the relevant information for one particular reference together? We have to take into account that the number of results increases. What kind of options do we have? And looking at the manual versus the semi-automated. So it will all come down to the coming research. And we're gonna look on Google Scholar for more methods. We're gonna use your input from the mentee. If you use other methods, we're gonna update our literature research. We're gonna analyze the user experiment experience and finish the analysis of the workshop results. Hopefully also finish the analysis of the false positive and hopefully contribute to the development of the assist R package. And if you wanna know more, there's more research going on in deduplication research. So this is that paper from Caitlin here about assist. It's available via bioarchive. McCown is also looking into different deduplication methods and the rat bone paper that I cited in the beginning as well. So this is all additional deduplication workshop and there are still some questions in the chat, but please, if you wanna help us out with our technical performance script, please let us know and we are so happy to talk and looking forward to collaborations. We have a manuscript that we're working on, hopefully submitting it in the summer where we're trying to get a comprehensive overview of all the pros and cons of the different deduplication methods that are out there. Please save your workshop output to our surf drive. And a big thank you for all of you guys who participated in this workshop. Thank you so much. Thank you so much, Kristin. So there are a few conversations happening when Laura and Erica are taking over in the chat. And there was just a comment on Slack really just to mentioning if the authors are comparing ASYSD with Zetero on this session, but I do not think you will tackle this during this session. Zetero was your question? So the question was from, you know, did the authors compare their ASYSD with Zetero yet? But yeah. Assist was compared to other methods. Yep, it's in the reference right here. The automatic systematic search deduplicator compares it to different methods. And we're trying to compare it to even more methods. Great, thank you, yeah. Yeah, I thought it wasn't demoed but that there's a reference for it as well, if you'd like to have more information on this. Yes, I still can see a few things on the chat and Laura and Erica are helping with. None for me on the Slack channel or Twitter at the moment as well because also Katelyn and everyone on Slack are really very active and responding to all questions and comments. Thank you all so much. Great, so again, if you have any further questions, please feel free to join the Slack channel that Floor just shared the link just earlier. I thank you so much Floor for sharing this link. And if you want to add like there's more question after this session today, please feel free to join this Slack channel and you can ask questions to Kirsten, Floor and Erica and Katelyn. And thank you so much, really, Kirsten, for your presentation. Your thorough presentations was really fantastic. We hope everyone you enjoyed it as much as we did. And thank you so much, Erica, Katelyn as well and Floor for your tremendous help today with also moderating the questions and the chat and over Slack channel. That was really, really helpful. Thank you.