 Great. Well, I am Mary M. Zaringham, and I'm really excited to talk to you all about a reproducibility workshop that we hosted at the National Library of Medicine, where Lisa Federer, who is a tiny box in the corner of your screen now, and I currently work. And so the idea for this reproducibility workshop came from sort of a personal experience in my graduate career. So before I came to NLM, I was getting my PhD in molecular biology and bioinformatics. And during the course of that, I had the unusual privilege of being scooped five times, which I think is some sort of record, but it did give me this really nifty opportunity to do a sort of meta-analysis across the papers that had scooped me to see how well their results actually agreed with one another. And by undertaking that process, I actually got to learn a lot about bioinformatics, about data analysis, and also data sharing and reuse, sort of digging into their data and their code. And so when I came to National Library of Medicine, I thought, wouldn't it be great if we could try to to have intramural researchers across NIH engage in a similar sort of experiment, trying to reproduce papers from the bioinformatics literature. And it seems like I have lost Lisa and the slides. All right, let's wait for a few seconds. The slides aren't super necessary for this part of the talk. So I can just kind of, they're more decorative. So I can keep going along until Lisa is able to reconnect. And so the idea was to host a reproducibility workshop at NIH to try to understand what can researchers learn by actually trying to engage in the process of reproducing a published piece of literature. And so the goal of this workshop was sort of threefold. First, we wanted to help researchers develop an understanding for the tools that might facilitate reproducible research, along with NLM data resources for bioinformatics. The second goal was to help them develop an understanding of how to incorporate these tools into their own research practices. And finally, with all of that, we wanted to give them a path towards some sort of deliverable in the form of an executable notebook or a publication where they documented the process of trying to undergo this replication during the course of the workshop. And we also had a sort of sneaky goal for ourselves of trying to, as we put together this workshop, think through what a curriculum might actually look like around reproducibility. So how can we think about developing a hands-on curricula to teach reproducible research practices? How are researchers actually approaching reproducibility? So over the course of this three-day workshop, we got to see how researchers were trying to undertake a reproduction. And then we also wanted to identify some low-hanging fruit that might help us promote reproducible research practices, both while researchers are sort of in the process of doing their work. And then once they've published their work, how can they publish it in such a way that it is maximally reproducible? And Lisa still seems to not be back yet. Her computer might have frozen. Okay, so I'll tell you now about the structure of our workshop. So it took place over three days and we ran two iterations of the workshop, one in the spring of 2019 and then one later in the fall. So there's a three-day workshop where we selected 25 intramural NIH researchers, so researchers who work on NIH's campus. And we put them into five teams of five to reproduce a bioinformatics paper, making sure that the underlying data were available in an NLM-hosted repository. So GenBank, SRA, and a few others. On the first day of the workshop, we had an opening keynote lecture giving a primer on open science and reproducibility, trying to really motivate, set up the motivations for this workshop. And then we had three 30-minute tutorials on executable notebooks, specifically looking at Jupyter notebooks, version control, looking at Git and GitHub, and containerization, looking at Docker. And then on the second and third days, the teams worked in their groups in a sort of codathon style, working to reproduce the results of those bioinformatics papers. And at the end of the workshop, we had them share out whatever it was that they were actually able to reproduce. Oh, perfect! This is perfect timing. I was about to pass it over to you. I'm going to try and share my screen so we can have the slides. Hopefully this won't be disastrous. We're about to be on slide six. So at the end of the workshop, we had them share what they were able to actually reproduce, what were some roadblocks that they experienced, and to help us sort of learn how to chart a path forward around more reproducible research practices. And so now I'm going to turn it over to Lisa to share some of the takeaways that came from that share out. All right. So hi, I'm Lisa Federer. I'm the NLM Data Science and Open Science Librarian. And as you can tell, I have a not very good internet connection at home. So I think one of the really most striking takeaways was that out of all of the papers that we looked at over the two workshops, not a single team was able to successfully reproduce a paper. And we're talking about people who are NIH scientists, very accomplished people who, if this could be reproduced, you would expect that they could reproduce it. People had sort of different pieces that they could reproduce, like maybe they could reproduce a particular figure or reproduce a finding, but with like different data. But in terms of actually going through the papers and reproducing them, none of the teams were able to do so. And looking at why they were not able to do so was informative about some of the challenges to reproducibility. And really looking at all of this, I think we found very much that reproducibility is not trivial. There's quite a lot that has to go into a paper and preparation for that to ensure that a paper is truly reproducible. One of the issues that we ran into quite a lot was missing underlying data. So maybe only part of the data was there or the data was not where it said it was for whatever reason, despite the fact that these papers were supposed to have their data available, the workshop participants were not able to get to it for one reason or another. Another issue was missing software and tools. Some of the papers did a good job of documenting what software they used to do their analysis. Others would say things like we used an in-house script. And they didn't share that code. So it would be difficult to reproduce that analysis unless you were able to reverse engineer the code and write it yourself. Other times, even when they did note the software and tools that they used, the description that they gave of that software was not adequate to allow someone to reproduce the analysis. So for example, they might say that they used Python, a specific package, but they don't give the version. All right. We now know that the internet is horrible. I guess I can take over from there. And if you're looking at your slides, I'll continue from there. So another issue that folks had was that they noted that some of the workflows were inadequately described or difficult to follow. So they didn't necessarily make note of, oh, is Lisa back? No, I'm sharing. Okay. So they didn't necessarily make note of what data were supposed to go in, where in the analysis pipeline. And so people were sort of trying to figure out, you know, map software and tools to what they thought maybe a data set might be used for. And so they were sort of left trying to reverse engineer without really a great map. So if you could go to slide nine. Tell me one. Okay. Slide back. Back one. Yeah, back and back another. And back another. There we go. No, forward. There we go. So given some of the the issues that researchers had working on these on these reproductions, a bit of low hanging fruit is just having better minimum standards for peer review. So peer reviewers could actually check to make sure that underlying raw data are actually made readily available. That when a paper says that data are available in the supplement, that the supplement is actually there. They can also look out for making sure that all software and tools actually detail the appropriate version. So that when researchers are trying to sort of recreate the computational environment they have, they're not, you know, sort of taking random guess random shots in the dark. And then also making sure that the underlying analysis tools are made readily available so that people aren't having to do a whole bunch of reverse engineering work and sort of keeping their fingers crossed that they did what the original researchers actually did. So next slide please. Lisa, do you want to take this? Yeah, I think I'm back. I like I halfway talked through this slide before realizing that I wasn't connected anymore. So so one of the things that we found was that there's not a single definition of reproducibility. And there's a lot of different ways to interpret that. One of the things that some groups ran into was looking at raw versus processed data. There was one group that I think like got a whole day and a half in to their analysis, their reproduction before they realized that they were using the raw data when they thought they were using the processed data. And of course, now the dog is going to start to bark. This is delightful. So understanding what you're using and you know, deciding whether what's more valuable to share that raw or that processed data. The idea of reusing scripts versus reengineering them, it's obviously easier if somebody has already written a script to just reuse what they've written. But in some cases, you know, that was not an option for these teams because those scripts weren't shared. And so they had to reengineer them. There's also the question of like recreating the computing environment. And what is a close enough environment? So you're never probably going to actually have the exact same computing setup as the group when they originally ran the analysis. But what is an environment that's close enough? And what are some, you know, potential solutions that we can use to that things like containers to get a little bit closer to the original computing environment to increase the possibility of reproducing the paper. And then we also, you know, ran into the issue of like, what is sort of the bar for reproducibility? I said that, you know, one of the teams was able to regenerate the figures at least, even though they couldn't get the whole paper and other teams were able to regenerate the general conclusions, even if it wasn't through the same route. So what are we really talking about here? Next slide. So I think what we found is that clarity and community consensus around expectations for reproducibility could go a long way, as Maryam probably said while I was kicked off. You know, we found that there was a lot that could be done in the peer review process that would solve some of these problems or at least help authors to understand what they need to include and potentially flag some of those things early on in the process. Next. Another thing that I think we found really interesting was that there was actually like some pretty open communication between the original authors and the teams that were trying to reproduce the papers. There were a couple of teams that did reach out to the corresponding authors for data or if they had, you know, questions about the methods. And authors in several cases responded within hours, even there was someone who was I think in London and, you know, they were able to respond even, even though it was like nighttime there. So I think what that tells us is that it's not the result of bad faith that people, you know, don't want to make their work reproducible. It's just that in many cases, I think people don't quite know how. So having some of these guidelines and having some of this, you know, oversight and peer review could help with that. Next. So that concludes our talk. Thank you for bearing with my internet challenges. And we'll end there. Thanks. Thank you so much to the both of you. I think we might have a time for a question. So I'm going to go with the first one, which is what kinds of guidance did you provide to teams for choosing a paper to reproduce? So that's a question. And I actually was the one who found the papers to reproduce. So I did that by making sure that data were made available in one of NLM's repositories. And I did that just by searching the paper for a GenBank accession number or an SRA accession number. And there were a few other repositories that I also looked out for. And then making sure that they were actually bioinformatics papers that they wouldn't have to, you know, go and do wet, wet lab work, since that certainly isn't feasible in a three day period. And then I sort of scanned through the papers to make sure that it looked like something that was feasible to at least make some headway in reproducing in the span of three days. And even, you know, barring those checks turned out that they were not so trivial to reproduce after all. Cool. Well, thank you so, thank you so much to both of you.