 Welcome back to the Riffimonus Reputus World Research Tutorial Series. If you missed the introduction tutorial, please go back and check it out to learn the background behind the series and to familiarize yourself with our goals and where we're going to be going. As we go through these tutorials, there will be several discussion items that you should discuss over lunch, at your lab meeting, or individually with your research group and with your PI. For many of these questions, there are no, quote, right answers. The concepts we will cover in these tutorials are difficult. That's part of the reason why there is a problem with reproducibility in science. Hopefully, you can have a good discussion on these questions within your research group. There will also be a number of exercises for you to work on where there is a correct answer. For these, remember that you can always hit the P key on your keyboard to pull up the presenter notes, which is where you'll find the answer to these questions. Also, when these come up, you should pause the video to give yourself a few moments to think about the answer and to explore the materials further. In future tutorials, there will be hands-on coding exercises that we'll work on together. Of course, you will want to do each of these activities to build your own background and expertise in handling issues that surround reproducibility. At the end of the tutorial series, there will be an opportunity for you to document your progress and receive a virtual badge and certification that you can include to document your participation in this tutorial. As I mentioned at the end of the last tutorial, today's tutorial is appropriate for those doing the analysis, as well as those whose job is to primarily supervise scientists in their analyses. So let's go into today's material, which you can find in the issues and reproducible research tutorial at riffamonus.org. Wonderful. Let's go ahead and pull up the slides for this session's tutorial. And remember, we can get to the slides by going to riffamonus.org, clicking on training modules, followed by reproducible research. This pulls up the reproducible research tutorial series page, where along the left side in the middle of the screen are the listing of the various tutorials. Today we're going to focus on the session issues in reproducible research. So go ahead and click on that. Very good. So for today's tutorial, what we are going to focus on are several learning goals. The first is to discuss the origins of what's being called the reproducibility crisis. We want to differentiate between reproducibility and replicability. We're going to identify the pressure points in making our work reproducible. And then we're going to appraise ongoing and published research products for hallmarks of reproducible research. But before we go any further, I want to confirm that you've already taken a look at the Collins and Tabak editorial in Nature, the Casa de Val et al editorial in Mbio, and the editorial by Ravel and Womack that was published in Microbiome. They're short, but they're important takes on where we are in science in general and in microbiology in particular. So perhaps you're familiar with this fictitious journal, the Journal of Irreproducible Results. I know when I was working in the lab more, we frequently joked about people whose theses would wind up in this journal that we would do great things in the lab, get really excited about it. But before we told our PI, we went, well, let's just do it one more time. And sure enough, it wouldn't work out in that we had forgotten some control and we had forgotten some reagent. And so the work just didn't reproduce. And so the goal of this series of tutorials is really to avoid anyone thinking that their work belongs in the Journal of Irreproducible Results, that we want all of our work to be reproducible because we want others to be able to follow up on what we're doing so that we can move science forward. And so two definitions that we're going to build from in this series of tutorials for reproducibility and replicability come from Jeff Leake and Roger Pangs editorial in the journal PNAS published in 2015, in that they define reproducibility as the ability to recompute data analytic results given an observed data set and knowledge of the data analysis pipeline. So if I know how you analyze the data, I have your data. Can I get the same result? Replicability in contrast is the chance that an independent experiment targeting the same scientific question will produce a consistent result. So if somebody else does the experiment and uses our same general approach, do they get the same results? But beware, because depending on who's talking, the definitions that we're using here might be flipped. And so always be careful about how people are defining things. I find this is kind of similar to how many people talk about diversity in the microbiome literature. And I always wonder, did they actually measure diversity or do they mean richness or what exactly do they mean? So the same is true for thinking about reproducibility and replicability. Another framework that I like to think about comes from a talk by Christie Whitaker that she posted on FigShare that builds off of what Leak and Peng talked about. And so using the same or different code and using the same or different data, you can think of research as being reproducible, replicable, robust, or generalizable. I like this framework a lot because it really helps me to think about and contextualize different attempts to validate other people's research. And so I've built upon it a little bit. So instead of thinking about the same or different data, I think about the same or different populations or systems. So is the work done in mice versus humans? Is it done in different strains of mice? Is it done in different cohorts of humans? Is it done using the same or different cell lines? Along the rows, we can think about the same or different methods. Did they use 16S RNA gene sequencing? Did they use metatranscriptomics? Did they use metabolomics? Did they do culturing? Did they use the same or different methods? And from that then, if you use the same population or system and the same methods, then we would hope that the work would be reproducible. In contrast, if you use a different population or system, say instead of using people from Michigan in your study, use people from Korea, is the work replicable? Similarly, if we use multiple methods to triangulate on a result to increase the robustness of that result, we can then think about doing that in different populations or different systems to then test whether or not our result is generalizable. So as a pop quiz, what I'd like you to do is to think about different questions that I'm going to flash up here as being reproducible or replicable. And so if you want, you can hit the pause button to think about it for a minute before getting the answer from me. So a new person joins your lab and tries to repeat a previous lab member's experiments. That is an example of reproducibility. You download data from another lab, follow their methods, and you then try to regenerate their results. That also is an example of reproducibility. Next, you rerun the mouse model that you ran in a previous paper but using a different strain. This, in contrast, is an example of replicability. Someone performs your study using a cohort of subjects from Korea. That is an example of replicability or perhaps generalizability. Finally, a colleague asks for your raw data and any scripts that you may have to repeat your earlier analysis. This is similar to the second question here and is an example of reproducibility. So something we might think about are what are the threats to reproducibility? What are the things that stand in the way of someone else trying to reproduce our work? And so what I'd like you to do is go ahead and pause the video and jot down a list of bullet points or different things that you think can limit the ability of another person to reproduce your work or limit your ability to reproduce another person's work. So go ahead and hit Tab and take as much time as you need to generate that list. Here is the list of things that I came up with. Again, this is not the perfect list. It's not exhaustive and it's perhaps particular to what I'm thinking about when I came up with the list. So many problems we have with data being available, methods sections being incomplete, different operating systems behave differently, sometimes different programs are available and different operating systems but not another. We know that our software and databases evolve. The software package I developed is on version 40. It has evolved a lot over those 40 versions and so the results you might get from version 1 to version 40 might be a little bit different. There's also problems with methods rabbit holes, where you read a method and they say, well, you read a paper and they say, we'll use this method. And so then you go to that paper describing that method and they say, well, this method is a mash-up of two other methods. And so then you just keep going down and down and down the citation ladder. And as you go further and further down, you start to question what exactly did those original researchers do. There's the issue of random number generators. A lot of times in bioinformatics, we're doing things like bootstrapping or Monte Carlo simulations to calculate p-values. And so those use random number generators. And so we might get slightly different results from the time we run the analysis. Similar to the problem with short method sections and rabbit holes is that there's frequently details missing in a protocol. The availability of software that we know that many times people will say that the code is available upon request is the problem of custom fill-in-the-blank scripts and filling in the blank with different programming language. I did my analysis using custom Pascal scripts, but if nobody knows Pascal, then how reproducible is that going to be? There's also a problem of link rot, whether a URL or email address may no longer be accessible. So you go to contact the original researchers for the data or for the code and you get an error message back saying, sorry, that email address is no longer valid. Next, let's think about some threats to replicability. Again, the ability to get the same result using a different population or system. So again, take a few minutes and jot down several ideas for things that might limit our ability to replicate another group's research. So again, my own list, not the perfect list, not exhaustive, but a list of things that I think a lot about in terms of replicability. So first, it might not be a real effect. So that first study that found the result may have been mistaken. It may have been reproducible, but it might not have been a real effect. There might be problems with mathematical models, statistical models that we're using that overfit the data in the original study that then just can't be validated in a replication study. There's a problem of poor experimental design where things aren't controlled for that needed to be. Problems with contaminated or mistaken reagents. We think we have the cell line or the specific strain of bacteria only to find out later that we were sent the wrong strains. Problems with confounding variables. Variables that we're just not controlling for because we don't think about for some reason. Differences in sex. We might do an experiment using male mice. Someone else might do it using female mice. And we might get different results due to differences in sex. Same thing might happen with differences in age or differences in mouse genotype. And those factors of thinking about sex, age, and genotype are actually pretty interesting. They raise up other biological questions that we might be interested in following up. There might also be differences in reagents or populations, environments, storage conditions. There also might be the problem of sloppiness where the original researchers may have not been thinking thoroughly about what they're doing. We might also be thinking about poor laboratory skills, contaminated reagents again, things like that. There's also a problem of selection and experimental bias that we see a result in the literature and we then want to test it with our own population, our own system. And what we don't know is that 20 other people have tried the same thing and have failed to replicate it as well. So that one study stands out because it was a huge result. But the fact is that it stands out because nobody published all the negative studies. And then there's the problem of experimental bias where we find a result and now we go looking for further proof of that result, not trying to find proof against the result. And of course, there's always the concern about fraud and scientific misconduct. So I hope you might see yourself or have experiences with all of these bullet points of threats to reproducibility and replicability. And the point I want to make is that this stuff is really hard. Science is hard. It is difficult to do reproducible research that others can replicate. There's no getting around it. It is hard. And so this brings us to the editorial cabinet of all and his colleagues that is questioning the quality of research in the biological sciences and specifically the problems of reproducibility in microbiology. And this grew out of an American Academy for Microbiology report from their colloquium that they did discussing problems in sciences and thinking about reproducible and replicable research. They focused on three main causes. For the lack of reproducibility. These included sloppy science, selection and experimental bias and misconduct. So we had those and at least I had those in my list. But as we saw, there's many other reasons that are less glamorous or less, maybe we say they're more humble reasons for why we might have a problem with reproducibility or replicability. And so I feel like blaming our problems on these three issues is really naive and short-sighted. And I'd like to think that if you or me were in the room at that colloquium, we perhaps would have come up with different reasons than these three for the problems in reproducibility and replicability. And so a question to think about as we all focus on trying to improve the reproducibility and replicability of our research is that research that's perfectly reproducible, perfectly replicable, is it necessarily correct? And I would say no. But I'm optimistic that research that is done well and is described well and is transparent is more likely to be done correctly. If nothing else, it provides an avenue for others to see how sensitive an analysis is to deviations from the pipeline that we propose and to perhaps go back and figure out where we were wrong in our assumptions or incorrect in our data analysis. Following up on that is a result that fails to replicate someone's, quote, fault. Can we blame somebody if we can't replicate previous work? I would say not necessarily. Sure, there's problems of sloppiness and bias and fraud, but we might also be talking about biological phenomenon that we haven't captured. And it's also important to keep in mind that replication is a product of biological as well as statistical variation, that a small p-value does not guarantee the correct result. So as we move forward, a theme that we're going to come back to time and again is the issue of documentation, transparency. And so I love this meme because it gets to the root of a lot of issues we have with reproducibility and replicability. I'd love to draw this beautiful owl. And so the instructions in the book are how to draw an owl, draw two circles, draw the rest of the owl. And so, sure, those are instructions. Those are methods. Somebody could perhaps use them to draw an owl. But it's really not reproducible. It wouldn't tell me much about, say, how to draw a pigeon or a robin. And I believe that there's a reproducibility crisis, if you want to call it that, because our description of methods is pretty lacking. We need to learn how to use existing technology to virtually, electronically say, expand our methods sections. And so I'd like you to think about a case study that is very common and never ceases to amaze me how unprepared I am for this when it happens. But, say, my lab publishes a paper that's really awesome that gets a lot of attention. And I start getting emails from people asking about the nitty-gritty of how we did things, not because they want to throw rocks, but because they want to do it too. Unfortunately, because of how peer review works, the trainee now is long gone. They're off to a new, exciting job doing fun and exciting problems. And they are very slow to answer my emails. I suspect if you talk about this with your PI or other researchers, they'll have this problem as well. And perhaps you've also been the person to email the PI asking questions. And so what happens when you request this information from the author? And have you ever been on the other side of this and been asked for information? And what happens? And so what I'd like you to do is think about these questions and discuss them with your PI and have a discussion about this and perhaps motivate a justification for why reproducibility and replicability are important. And I'd like you to think also about what can you do to be proactive to avoid these cases? How do we not have that anxiety when we get these emails from people saying, what did you do here? I don't understand. A lot of information is embedded in laboratory notebooks. How useful are laboratory notebooks when we get these emails, especially as it relates to a computational analysis? And then how long are you responsible for maintaining these records? Okay, sure, if you get an email the day of the things published, you're on the hook, but three months later, six months later, a year, five years, 10 years, 20 years, how long are you responsible for maintaining that? And then how long can you reasonably expect someone to be helpful? I would hope that, again, the day my papers published that the trainee is going to be responsive and helpful, but again, five, 10 years later, do I expect them to be helpful? I don't think so, but I don't know when exactly it transitions from expecting someone to be helpful or to have the records to not having the records or not be helpful. And so this was a real question and an interesting thing that we all run into is getting these emails or getting requests for further information. And so, again, discuss this with your PI, discuss this with your lab. What are some of the stories from your lab of when this has happened and what do you do? What does your PI do? What are their strategies for dealing with this issue? So Philip Born baited others into doing this experiment with him. So they published a paper in 2010 looking at the Mycobacterium tuberculosis drugome and he encouraged people at a workshop in 2011 to try to reproduce the work in that paper. And Born is a thought leader I would consider in the field of bioinformatics, very sensitive to issues of reproducibility, and they did an audit of how long it would take people at varying stages of their career and familiarity with bioinformatics skills to reproduce a pipeline, to reproduce the results from the paper. And what they found was that if you had somebody that had pretty basic skills in bioinformatics that it would take them 160 hours to get up to speed with the workflow and software. So that's four 40-hour weeks just getting up to speed on the workflow. And then it would take another 120 hours to actually implement the workflow. So I'm not convinced that that's while it is reproducible, I think it's beyond the level that a lot of us are willing to spend to reproduce somebody's work. And so it really sheds a lot of light on what do we mean by reproducible? How much friction, if you will, do we allow and still say that something is reproducible? If it takes 160 hours, is that reproducible? What knowledge does an individual have to have to be able to reproduce the work? And keep in mind that you're the person doing the work that knows far more about the work than anybody else ever will. And so assuming that somebody has the same skill set as you is probably a bad assumption. So a lot of the issue and discussion around reproducibility crisis comes from two papers that I'd briefly like to mention. The first comes from scientists at Bayer. And they stated that there was an unspoken rule among early stage venture capital firms that at least 50% of published studies, even those in top tier academic journals, can't be repeated with the same conclusions by an industrial lab. So Bayer sees a really cool result from an academic researcher. They want to perhaps convert that into a product for selling commercially. But they find that at least 50% of those pre-clinical studies cannot be repeated. And so in this review, they talk about, I think there's about 67 or so, maybe 70 studies that they had tried to reproduce or replicate at Bayer. And what they found was that there were inconsistencies in about 65% of the studies they studied. And that only about 4% were some of the results reproducible. And in about 5% or 7% I'm sorry, of the studies where the main data set was reproducible. And so this seems like a problem, right? Where a large fraction of the studies that they were testing just could not be reproduced. Now there are subtleties to this because they were not doing exact reproductions. These we would perhaps by our rubric consider to be replications where perhaps they were using different mouse models or they were using a different setup to fit what they were doing in their labs. But still it points to an issue. The next study was published by Amgen by Begley at Ellis. And they found that regardless of the impact factor of the journal that the original result was published in, that they found many of those studies also were not reproducible or replicable. And furthermore they pointed out that a number of other studies would come on after the initial result trying to follow up on specific points or building off of the earlier work to take the work further. And so their point is that there's a lot of wasted time and resources in trying to build upon flawed science. And so they state that cancer researchers must be more rigorous in their approach to preclinical studies. Given the inherent difficulties of mimicking the human microenvironment in preclinical research, researchers and editors should demand greater thoroughness. So this is a very negative I would say gotcha science type of attitude. And so, but there's something that if you've looked at these articles is that neither study did anything to demonstrate that their work as demonstrated in these studies, these reviews could be reproduced. The very thin details as to what they did, how they tried to reproduce or replicate these experiments. And so you maybe want to pause and think deeply about how much we want to focus on these two reviews. Okay. But at the same time there is a general sense that science is hard, science has gotten very complicated in the last decades and that we are not doing a good job of ensuring the reproducibility and replicability of our own research. And so to that Collins and Tabak published an editorial in Nature in 2014 talking about NIH's response to the so-called reproducibility crisis. And they point out a handful of problems that they then wanted to follow up on in policy changes at NIH. So they talk about publications that rarely report basic elements of experimental design, things like blinding, randomization, replication, sample size, effective sex differences, the underlying data really made available. There's a problem because there's restrictions on the lengths of methods sections. And then as we saw with the born paper they assume that the reader has an idealized level of expertise that's just not practical. So NIH has focused on a lot of these social factors and so the idea that we fail to publish negative results this is commonly called a file drawer problem that we stick our negative results into a file drawer and never report them again. Some scientists they say repeatedly use a secret sauce to make their experiments work. This is a problem, lack of transparency. There's also perverse incentives to publish and hype striking results and this ties in with the idea of the impact factor mania where people are very focused on getting their work published into a journal with a high impact factor that the ability to secure a K award or to get tenure or promotion is tied to the number of papers that you have in journals with very high impact factors. And as I said NIH has gone out of the way now to go ahead and think about new guidelines that they have imposed to improve the reproducibility and replicability of our research about how do we justify the premise experimental design issues what are biologically relevant variables how do we authenticate our reagents to know that they are what we think they are. There's been a big push by NIH to improve the reproducibility and replicability of ongoing research and much of these efforts have focused on what we are calling replicability This tutorial series was funded in part by an R-25 that I received from NIH dealing with reproducibility in the area of microbiome research. So thinking about the microbiome world Drachevelle and Eric Womack published an editorial in microbiome called All Hail Reproducibility in Microbiome Research and so this was one of the assigned readings in this tutorial. So I want to ask what your initial thoughts were about this. What do you think? One of the comments that jumped out at me was when they said it is no mistake that the best documented code turns out to be more frequently used by microbiome researchers that if you go out of your way to make your data available you make your code available you document it, you make it clear what you've done it is going to get cited and it is going to be used. I have certainly experienced that with our mother software package that by making it as open and transparent and well documented as we can we have gotten a large number of citations because we have engaged in those practices. So what I'd like you to do is take the next four or five minutes and read back through this editorial and see if you can identify three or four different technologies or platforms that the authors point to giving reproducibility of microbiome research. So go ahead and hit pause and once you have done that go ahead and come back. Alright, so they focus on several tools related to data accessibility including depositing sequence data in the sequence read archive as well as in DBGaP other data we might put into a website called the fig share as a repository for data using a metadata standard called memmarks writing our code and our workflow together using tools like the project Jupyter notebooks which they call in this editorial ipython there's been a switch in name since this was published as well as knitter documents in the R environment and then finally they also talked about using version control tools including software like Git as well as a website built on top of Git called git hub and so as we go through this series of tutorials we're going to describe how we would use these different tools to improve the reproducibility of our own research we're also going to go considerably further than Ravel and Womack went in this editorial to further improve the reproducibility of our research so if we think about microbiome research in particular we might think about different threats to reproducibility and replicability okay and so these are things that I want to build on beyond the lists that we came up with earlier and so one of the issues is a lack of standard methods and this is a very thorny issue with a lot of people myself included because we pick our methods to answer specific questions that are relevant to us today right like I use the variable region I use to sequence the 6-genus gene because I've got a specific set of questions well if you're studying a different set of questions or in a different part of the body you might want to use a different region that's going to make it really difficult to compare results in a meaningful way the accessibility of data is still a big problem not everybody is depositing data in the SRA not everybody is providing metadata to actually make that data useful we all use different populations kind of the beauty of the microbiome field right now is that there are people asking the same question in different cohorts we also have really complicated and lengthy data analysis pipelines you know there's many steps there's many parameters many variables we might use different options for different databases we might use it's complicated it's lengthy one of the things that drives me nuts is reading a manuscript that says we analyzed our 6-genus data using mother and that's it right they don't say anything about how they use mother I know use it as the mother developer that there's thousands of ways you could use mother to analyze your data and so we need greater clarity to help us through this complexity and lengthiness of our data analysis pipelines there's variation in our mouse colonies if you look at the gut microbiota of mice from Jackson versus to conic versus those in the breeding facility here at the University of Michigan we're going to see big differences those differences can cause differences in the phenotypes we're interested in we also know that there are contaminants that show up from our reagents when we're sequencing low biomass samples this has raised a lot of questions in regards to whether or not there's a microbiome associated with the placenta or with lungs we also know that there are sampling artifacts where how we store our samples what we do with our samples the size of our samples can all introduce artifacts that are going to impact our downstream results what I'd like you to do next is go back to that Ravel and Womack editorial and look at the paper by Meadows at all as well as the github repository that they cite in their paper and so I'm going to pull that up for you here and so this is the github site that has the repository for Meadows at all and what I'd like you to do is take some time to see how accessible is the code okay the code is here how accessible is it can you make heads or tails of what's going on what do you need to know to make sense of the repository is it structured is it organized in a way that you can understand what's going on is it well documented how long did it take you to find the code for the figures in the paper okay so go ahead and go through this and think about about the reproducibility or how this helps improve the reproducibility of the paper from Meadows at all great so I really tip my hat to Meadows at all because they put this up at a time when very few people were thinking about how to make their work more reproducible and so I really credit them with doing a good job of making their code accessible that being said I think there's still a number of points that they could improve upon to make the work more reproducible more transparent more accessible so first of all the sequence data that they published was deposited into the fig share and not into the sequence redarchive there have been times in the past where the sequence redarchive has been really difficult to work with for submitting 16S data and I suspect that's why they ultimately put it into fig share it's not a big project the code isn't very well organized it doesn't jump out at me where the data are, where the code is where the output is a good thing is that the RMD file provides the narrative to explain what's happening and why they made different decisions and the code for the figure is in the lilasurfaces.RMD file and it's rendered in the lilasurfaces.MD file and so it's there I could certainly find areas to improve upon but again as a first step and one of the first studies that really dug into making the work more reproducible I think they did a great job if we come forward in time a bit we might think about more recent microbiome papers and how well they've gone out of their way to make their code their analysis more reproducible and so I'm taking three repositories from my own research group for three papers over the last few years I don't consider these to be perfect I think if you go through from Zacular to Baxter to Z you'll see differences and changes in how my group has approached ensuring the reproducibility of our data analysis pipelines and so something that we can use as a checklist to assess a recent microbiome paper might be how difficult is it to regenerate a p-value or a figure right could we go all the way back to raw data to regenerate that p-value or a figure where are the raw data is the code available how long is the method section do they do that thing where they say we used mother or do they tell you about the databases they used right the R packages that they used so I'd really encourage you to go through these three examples that I think are pretty good I'm biased of course but at the same time I know we've evolved and how we approach these types of repositories and our approach to ensuring that things are reproducible so go ahead and pause the video and take a few minutes to check out those repositories great so I hope you enjoyed looking at those the repository that we see for ZN Schloss is more along the lines of what we're going to have as an output from our work at the end of this series of tutorials so why is reproducibility and replicability important perhaps we think that's obvious right well obviously we want to make sure that our results are correct and or generalizable and I think this is where most people stop when they're thinking about reproducibility and replicability they're thinking about correctness and as we said earlier just because something is reproducible doesn't necessarily mean it's correct I'm also interested in making sure that others can build off of my work and I also want them to take my work and repurpose the materials and methods to do something different right so we've done a lot in the area of finding signatures of colorectal cancer using 16S RNA gene sequencing well if somebody publishes a new study doing the same type of thing and they don't use my data that's a bummer I mean we invested a lot of time and effort and money into generating and analyzing those data it'd be a shame that others wouldn't use those data to build upon for their own work similarly if people have looked at our code and found it useful to answer a question for a certain set of subjects or experiments I would hope they'd find it useful for their own set of conditions and experiments and populations so ultimately I think of reproducible research as being a form of preventative medicine and this is a phrase that was coined by Jeff Leek and Roger Peng in that editorial I mentioned at the beginning of this tutorial there's way too much emphasis on gotcha science way too much emphasis on the amgen and bear papers research that's done in a manner to maximize reproducibility is going to make life easier for you in the long run so you get that email from somebody that's excited about your work if you could send them to the repository that has all your code and it's well documented wouldn't that be amazing it's going to instill more confidence in others and then it's going to be easier for others to build from and so really think about this as a form of preventative medicine as a way of fostering collaboration with others going forward and so as we think about collaboration your collaborators need to be able to reproduce your research and so who has to be able to reproduce your research I told you this at the end of the last tutorial but you you have to be able to reproduce your own research and think about that in six months from now and think that current you no longer has access to email how would you write for yourself six months from now knowing that you couldn't get in a time machine or email old you to figure out what you were doing so take care of yourself be preemptive in how you think about reproducibility second your PI is an important collaborator they need to be able to reproduce what you're doing because at some point you're hopefully going to graduate and move on to greener pastures and your PI is going to be left behind to figure out what the heck you did and how they can communicate that to other researchers who want to build upon your work but also their own work right so where are we going with this series of tutorials so we're going to spend a lot of time in the next session talking about documentation in terms of text based documentation but also in future talks about data organization as a form of documentation using code as a form of documentation and thinking about automation as a form of documentation right that if we can tell the computer how to run the analysis in an automated way well then we need to convert from computer speak from code to human text to understand what's going on and along the ways we'll learn some best practices like keeping our raw data raw as much as possible and not touching it with our hands or with our cursor not repeating ourselves with our code and then using tools that enable collaboration and transparency so in closing I'd like you to spend some time thinking about a series of questions and again think about discussing these with your PI and your research group with your friends so first how difficult would it be for you to regenerate any of the figures or p-values from a paper that your lab published five years ago so if you open up a paper from five years ago and go to figure two how difficult would it be for you or one of your current lab mates to figure out how to regenerate that plot, that figure for your next paper what's the most important thing that you can do to improve the reproducibility of that data analysis so Amgen and Barris said they thought it was about 50% but what percent of papers published in your field do you think are reproducible replicable and then why do you come up with those numbers and then finally building out upon your two important collaborators of yourself and your PI what other broad groups of people need to be able to reproduce or replicate your work and think about them in terms of their own skill sets and their own knowledge and this gets back to the idea of the born paper where it took somebody with decent bioinformatics skills a long period of time to get up to speed with the analysis that was done in a previous paper wonderful hopefully you found today's discussion helpful in getting you to think about your own reproducible research practices and where there might be gaps in your practices at the same time I hope you can have a greater appreciation that everybody has gaps in their practices this stuff is really hard please be sure to take the time to engage those in your research group your lab mates, your PI's those around your lab and ask the material and questions that came up in today's tutorial next time we'll start to develop more practical steps to greater reproducibility we'll slowly be ratcheting up the technical material in the next tutorial until then have fun engaging these questions and having discussions with those around you those you're doing research with think you'll find it to be really useful really illuminating and really helping to point the direction of reproducible and replicable research