 Good afternoon, everyone. Welcome to this webinar on the Reproducibility Project Cancer Biology. My name is Courtney Soderberg. I'm the statistical and methodological consultant at the Center for Open Science. And this webinar is going to be done by my two colleagues, Alex Dennis and Tim Errington. I'll let them introduce themselves. All right. Thank you, Courtney. So I'm going to start off. So I'm going to introduce myself quick. I'll let Alex introduce herself. And then what we'll do is I'll give an overview and dive in a little bit and then Alex can take it from there. So I'm Tim Errington. I'm the Meta Science Manager here at the Center for Open Science and the lead on the Reproducibility Project Cancer Biology that we'll be talking about. And my name is Alex Dennis. I'm a research coordinator on the Reproducibility Project for Cancer Biology. Great. So this is pretty timely. We thought that this would be a great opportunity to talk about our project and more importantly to talk about how we're coordinating our project, how we're using the open science framework and the, in our case, the statistical program are to manage our entire workflow. So the Reproducibility Project in Cancer Biology is a collaboration between the Center for Open Science and Science Exchange. And we're partnering with the Open Access Journal E-Life to conduct independent replications of previously published high-impact cancer biology papers from the years 2010 to 2012. And we received funding from the Laura and John Arnold Foundation for this. And the idea for this project is to get an initial estimate of what the probability rate is in high-impact cancer biology papers, as well as trying to understand the challenges in conducting a project or even just conducting replications. And hopefully this will then start to be an experiment that can provide more information about how to improve reproducibility and profitability in cancer biology and in science more generally. So shortly, not that long ago, last month, almost a month ago, we published the first results from that project. So these are the five papers that we published. And E-Life is publishing a format called Registered Reports. I know that we've had a webinar on that before, and I'll talk about that briefly. And that's what you see in the middle there. And then what we published recently are these replication studies, so the results of our experimental plans. We're not going to dive into any of these specific ones, but of course if anyone has any questions about these, by all means feel free to ask us. We're happy to answer any questions or at least point you to more information. Instead, like I said, we're going to try to focus more on what was the process that we were trying to do in conducting each of these replications and how we're doing this for every single replication of this project. So a way to kind of bring it back first is to really kind of create some definitions or at least try to define them the way that we're looking at them. So what is reproducibility and replicability? So there's a couple different definitions that are worthwhile to mention. One is just computational reproducibility. So if we took, you know, the original data or anybody's data and their code, the idea there is that we should be able to rerun it and get the exact same numbers, figures that were presented in any given paper or any given type of presentation. So that's in some ways, some people think of this as like kind of like that bare minimum. How do you re-obtain the same numbers or graphs that was originally presented using the exact same materials? Another one that is important, especially in the biological sciences, is empirical reducibility. So how do I, if I wanted to, whether I do it or not, how do I understand the information to be able to rerun the experiment or the survey? The same way that I was conducted. So this gets a lot more at the methodology and the approach that I was taking so that those procedures, those, you know, materials and methods. And again, this is more of, again, an understanding of what one needs to do to conduct that same experiment. And when you take two of these and you put both computational reproducibility and empirical reproducibility together, that allows you then to really ask about replicability, which is an independent person using those exact same methods, that exact same analysis, but using new data. So collecting new data to do this exact same thing. And then the question then is, do you get consistent results or not? And that's essentially what the heart of the product is trying to do, is to look through this entire process and both understand the challenges that are within it, but also to try to get at that last point. Can we get the same results considering everything else that we currently practice, both in the way we conduct our research and the way to be communicating our research? So others have written about this really nicely. Roger Pang had a nice article in 2011, really kind of highlighting computational reproducibility nicely. And one thing that I wanted to bring up here, and this is in the top right of this slide, is thinking about reproducibility as a spectrum. So if we just do what we normally do, which is I just publish results and I just publish my figures, it's really hard to understand that entire process is reproducible right from that computational standpoint. You don't really get more, you don't really get into a full replication until you start to look at the code, the linking of the code and the data together, that then allows you to kind of think about what was being presented in that paper and to really reanalyze that. And so this was shown also really nicely in terms of the steps that occur back in 2008. So there's a really nice graphic thinking about after I conduct my experiments and I have my data, no matter what that is, at some point I start to process that data so that I can analyze it and I can compute it. And then I present that data, I present those results as figures, as table, as numerical results that appear in the paper, and that always appears as generally a texture or figure. But the important part of this is thinking about how that is viewed from an author, someone who's generating that. So they're starting with the hypothesis and the experimentation and the data and they're moving it towards the paper. What you have to remember as readers, we're seeing it from the complete opposite standpoint. We're looking at that text, we're looking at that paper, trying to figure out what was actually done throughout the entire process. And so trying to essentially make sense of these two is what we're trying to get at with reproducibility. And then of course figuring out how that impacts our replicability as a focus of the project. Okay, so what we use in our project is something that there's been webinars about this and I'm putting some links on the bottom information here that you can find more information if you go to COS.io and also there's a previous training session last month and it sounds like there'll be another one at the end of this month talking about pre-registration and registry reports. So if you want more detail, I'd recommend signing up for that webinar as well as checking out the online information. But with our project, we're using this registry reports format. So if we think of the research lifecycle and kind of this short snippet linear process, we start with developing an idea all the way through designing, collecting, writing up your report and publishing it. And traditionally peer review occurs after you've written the report. But with registry reports, you have two stages of peer review. One, after you've designed that study, and two, after you've collected the analyze data and written the report. And that's what E-Life is publishing. The registry report is that first peer review and these replication studies is the second peer review. So what we thought we'd do is kind of go through this stage with our project and highlight what we're doing and why we're doing it. All right, so developing an idea. Well, for us, the idea, and if we just take any one of these original papers, that's our idea. Our idea is to replicate what was originally done and to take key conclusions from those original papers. So we're only interested in doing a subset of this, because we're looking to sample from the literature. And of course, we're trying to sample from high impact papers and kind of those more quote unquote key conclusions, just because those tend to drive the field a bit more. So if we think about this, this is where we're starting, which is how anybody else starts. We start with the paper and we start with the figures. And we basically are this is, you know, the information that we all do just consuming information, whether we want to replicate it or not. And is we all unfortunately know that current norm still is not to present everything. So we have nice bar graphs, but we don't have the raw data. We have methods, but there's no focus on that methods. And it's pretty common, as described in this one, which you see very often, which is those methods are performed elsewhere. And then you go find that paper and then you go to the next one and the next one and you go down the rabbit hole until all of a sudden there's no longer a method. Or the method is incredibly sparse and missing key information, whether that be the reagents or whether that be the procedure that's done. And of course, we're also missing data in terms of really trying to understand what was actually presented. So when we did conducted our project, we started with this point and essentially put out with assistance from a lot of volunteers, a document of what we perceived was the necessary steps in conducting those experiments from the paper. So from the data presented in the paper, the methods in the paper and any reference in the paper. And then of course, we had our own questions instantly. And those could be open-ended questions of well, what's the catalog number of something or how to perform an experiment in terms of just specific details or and as well as also asking about data or unique materials. And this in many ways is the start of the project. And there's a lot that we hope to really kind of dive into at the end of the study. That's why I thought I'd share with you just a couple. And these are unrelated to any of the papers that we're doing, but these are common concerns that come up when someone gets asked. And I'm sure if you've done any research, you've had these exact same concerns trying to dig through literature in your own lab potentially, which is we tend to lose a lot of this when we're trying to transfer what we've done into a paper. We have a lot of restrictions on the text size and the amount of figures that we can present. So a lot of that nuance, that important information gets lost. And then we also don't generally think about backing up, having those data in an easy manner to recall at any given point. And then of course, they're done by different people, right? A lot of research is collaborative at this point and not everybody's even in the same lab, let alone the same institution. So trying to go back after the fact and figuring out is very problematic for the authors as well as anybody trying to actually follow up on that research. And it's very common for people not to have the raw data. Generally, you know, once you've published, you've moved on and this information generally does not become readily accessible. So these already are some of the challenges that others have talked about. And so what we're trying to see is how do these challenges actually impact the ability to build on that research and see if you can obtain a similar result. All right. So after we've filled in all these details, we worked with the authors trying to get as much as we could. And if we didn't have that, we fill in those details ourselves. And this is what was being peer-reviewed and published at Eli's the last couple of years. And so when we look at these, what we essentially tried to fill in at any given point was, well, what did we know that we were going to end up using or that we perceived that we were going to have to use? And this goes down from all the materials and the reagents. And a lot of times we didn't know they were originally unspecified. So when they weren't specified, we filled them in with what the lab was planning on using. That included the procedures, right? Again, there were a lot of details that we're missing. Hopefully we got them filled in by the authors, the original authors, if they could find it. And if not, then we had the lab filled in. This is really important because this starts to get at if we are not presenting everything in the original paper and we don't have an easy means of accessing it, the next person who comes along has to start making assumptions. And then, of course, we also, an important part of this is figuring out how many samples are we going to use. A lot of these experiments that we're conducting are animal experiments, although we do have a lot of cell line and patient sample experiments. But the question then becomes how many, how many animals do we use in any experiment? And the goal of this project is to not have any reason not to detect the same original effect size. So we use that as our point estimate and determine how many samples would we need to have sufficient power to detect that effect. And so we do it minimum 80% power. But we put all those details as well. So all of this is in the Wenshire Report that gets peer reviewed. But importantly, even at this stage, we're already making assumptions and we're doing analyses in our power calculations. And so we also are starting to use the open science framework to store that information. So as a complement to each of these papers is additional materials that we can't really put into the paper in an easy format that we want to make sure that we can share with the reviewers as well as the general scientific communities so they can go back and figure out how did we obtain the numbers that we obtained from our power calculations, particularly. And then at that stage, we move into actually conducting the experiment, which is I think what everybody wanted us to do and what we finally were able to finish last month for the first five. And I'm going to pass over here to Alex to let her dive into this. Okay. So yeah, going off a little bit of what Tim was saying, at this point, we had all of our protocols put together and we've had those reviewed. And so now comes the collection of data and are beginning to analyze that data. So let's see. So a few things that I wanted to point out are some of the challenges or roadblocks and data collection and analysis that we had discussed and wanted to avoid in this project. So a few of those problems are collaboration with multiple people, multiple labs, multiple groups that may not be in the same institution is often difficult. And in our BCP, that was necessary. As Tim said, at the beginning, we're working with science exchange and multiple labs to go ahead and part out different parts of these protocols to reproduce or to replicate. So multiple types of data is going to be needed from each of these groups, as well as we have to find a way to get all of this data into one place. We need quality control data. We need to be able to keep all of our analysis scripts and our figure generation in one place, as well as it needs to be understood by every member of the team and everybody from these different groups. And we want to go ahead and do all of this while minimizing human error. So we have multiple analysis scripts, multiple pieces of data, multiple people, and we need to be able to have this clean process of getting it from data collection to eventually publication of that manuscript with as few errors as possible along the way. Okay. So a typical example before we move on of how this process might look is that we would have science exchange and COS gather the reagents as Tim was talking about earlier and put together the protocols. We would have quality control tests that would need to occur before experimentation. The results of those tests would need to be shared with everybody on the team. The lab would go ahead and collect the data. COS would help with analyzing that data and putting together the figures. And then the manuscript will eventually be published. So by the time we have that finished manuscript product, we have all of these different groups touching all of the different phases of this. So I want to go over a little bit how we were able to manage that. All right. So the first thing that I spoke about was collaboration. We obviously have to reach out to communicate with a lot of people through the life of this project. So what we really tried to utilize was the OSS contributors tool. So as you can see here, this is just an example of one of our replication studies, study 19 that was published and a few of the contributors who were involved. So we have each of the contributors and the permissions that we were able to adjust based on what was necessary to work with everybody. And this really helped in making sure that we were all on the same page. We all had access to everything that we needed access to and we could move forward without any communication error. So a typical process, for example, as to how this would work, would we have we would have all of those contributors on the previous page and they would have access to different protocols within the experiment or in some cases multiple protocols within the same experiment if the lab was working with multiple protocols. And they would be able to go ahead and upload all of the relevant data files to each relevant component by just dragging and dropping all of their data. And this is not just data files and Excel spreadsheets. This is all of their scanned lab notes that they may have that as Tim mentioned earlier, often doesn't get surfaced. This could be any preliminary analysis that they may wish to do, any figures, any images, scans, everything can be uploaded by the individual labs at those point at that point. So also, there is a recent activity tracker on the OSF that we can use to see the progress made. So in this example, you can see that this lab has added a Xenograph control file. And you can set we set our notifications to see when that stuff comes in so that we're always on top of all these studies, especially since we have 20 plus going on at the same time. Okay, so these are just a couple examples of the data files that we can store and preview and have been storing and previewing on the OSF. So we've got an HPLC report, STR profiling and mycoplasma testing. A lot of the quality control data comes in early. We also have this is an example of handwritten lab notes that we can store and reference at a later point here. And all of this has been uploaded by the labs as the experimentation process continues, as well as any original raw data files. So CSVs, Excel spreadsheets, however that data is collected can be stored in its raw form on the OSF. Additionally, we have all of our scripts and analyses, and any images that we generate. And all of those are able to be stored compactly in these components for each of the project pages. Okay, so once we have the data, and once it's saved to the OSF, our next step is to move into analyzing this data and generating all of the figures that will eventually be published and manuscript. So making data as open as possible not only includes surfacing this data, but for us it also includes making the data as interpretable as possible. And we want everybody who needs to go back and look at it or who wants to go back and dig into our data to be able to understand it. So one of our goals is to make sure that it was also interpretable analysis and code. As Tim was talking about earlier, when you go through and you read publications and you view those images and those figures, there's often a lot of missing information or missing context. So you could look at a figure of publication and leave with questions like what did the raw data look like? How many technical replicates were there? How many biological repeats were there? How, if it all was the data transformed? How was the figure generated? Et cetera. So here's an example of how we want to bridge some of those gaps for anybody who wants to view our project in more depth. So this is an example of one of the five that was published. We have a relative tumor volume over days with different cohorts. And looking at this figure in any type of publication might lead someone to ask more questions. So as you can see, this PDF file is stored on the OSF right here along with all of the other PDF files and supplement figures for the specific protocol. But in addition to that, we also have just below it the analysis script for how this figure was generated. So anybody who wants to view the PDF also has access to this analysis script. You can see exactly, not only exactly how this PDF was generated, but also as you can see here, we're pulling those data files directly from the OSF. So there are no translation errors between that raw data set, manipulating the data in any way that is necessary in order to create a figure, and when that figure gets in the manuscript. So you can see here, additionally, there's a download button. So if you'd like to download the script or if anybody would like to download the script, they are able to do that. And this is just what it looks like in my art console, but you can see here that we've tried to make this as clear as possible. We have not only the required packages, but we try to label every point throughout the process of these R scripts. Additionally, you can see here what I was referencing earlier, which is that we're using this get function to pull in all of those original raw data CSV files in order to try to minimize that error. Okay. So this is just a more closeup example. I just wanted to hit a few additional points, which was that what we're trying to do with this project, specifically with the data analysis is what we use R for all of our analysis figures and scripts because R is easily available. It's open. It aligns with what this project is getting at all scripts and data files are called directly from the OSF. So like I said, we're trying to minimize that error. We're trying to call everything directly from the source. We try to make all of the necessary libraries and version numbers listed for each script and be sure to label each part of the relevant code. So for right now, what we're doing is we're using the get command to call all data and scripts to the OSF. There is an R for OSF package in the works. And I went ahead and made a link right here. If anybody's interested in checking that out, there's a webinar about that as well. So if anybody's interested in learning a little bit more about that, you can go ahead and follow this link at the bottom. Okay. So in addition to trying to make the code not only available, but also understandable and interpretable for anybody who would like to dig deeper. One of the issues that we've encountered or that Tim discussed earlier in trying to build these protocols was often running into representative images and not knowing the context behind those representative images. So what we've tried to do is to surface all of that data and all of those images that are so often unable to be serviced in a typical manuscript because of space constraints. You can't post hundreds of scans or hundreds of images. So right here, what you can see is the figure four in one of our replication studies. And what we have are, we have our bar graph and we have our representative images. I think these are both prostate and heart tunnel stains. So as you can see at the bottom, there's a link to the OSF. And anybody who would like to view in more detail these images or see what they look like in the context of all the other images can follow that link. And so in this example, I followed the link and I wanted to look at figure G. So the treated mice, the heart tunnel stain and the treated mouse group. So as you can see, not only can I now see that image, the representative image, but I can also see all of the other images that we collected during this protocol. And this is something that we think is great because in addition to seeing what's published, we can also get all of that relevant background info and context that you just can't get in the regular publication. So yeah, all of our PCB projects will point to their OSF components, which will house all of the relevant data and all of the relevant images even when we can't necessarily publish them. Okay, so hopefully that gave you a really high level overview generally of how we try to collect and analyze the data in this project. So the next step, once we have all of this data collected and we've done all of our predetermined analyses in that registered report and we've analyzed it and we generate our figures, now we want to write the actual report. Okay, so sorry about that line, if you can see that. So currently, as Tim mentioned, the norm is to not surface all data and analyses, which can lead to unanswered questions. And as he talked about earlier, you can see I've put this reproducibility spectrum back down here again. So we talked about code and we're kind of moving along this reproducibility spectrum. We talked about trying to surface the code and the data and making sure that this is executable code and data and that it's understandable. But to kind of further that a little bit more, we want to make this accessible for anybody reading through the manuscript. We want there to be a trail from the manuscript all the way back down to the raw data files, as well as somebody starting from the raw data files and seeing how we built up to that manuscript. So this is important because reproducibility takes a lot of time and effort on all sides when information is not shared. It's not only a hassle for the reproducibility team or Tim or RPCB for us to go through and try to figure out and fill in those holes of exactly what steps in the protocol we may be missing. But it's also a lot of time and effort on the side of the original authors who are needing to go back and dig through their data files and look through some of the methodology or try to reach out to grad students who may no longer be at that institution. So we see more openness and sharing in this type of a project as being beneficial to everybody involved. So once we've done that, the method that we have taken to try to make our manuscript as re-greasable and open as possible is to use our markdown. And this allows us to call all of the scripts which call all of the raw data files and pull those into our final markdown which gets rendered to a PDF document and eventually gets published with eLife. So let's see, going forward, you can kind of see the process here with two screenshots. So we've got our replication study 15 rmd file and we typically use our markdown, as I said earlier, to pull in all of those r scripts. You can see here that we're downloading the scripts as well as trying to label as clearly as possible what each of those objects and are mean and how we're using them. And once we've completed the r markdown, we render that into a PDF file which we've saved both on the OSF. So as you can see here on the files tab for replication study 15, we not only have this final PDF version but we have that replication study rmd file. So the goal here is for you to be able to take any p-value, any number, any really any number in the rmd file or in the manuscript, look at the manuscript, pick out a number that you like, look at what that code is, the corresponds to that p-value in the r markdown file, and to be able to follow that path to the analysis where it came from and finally to the raw data file where the data was initially collected. So we're not only trying to build up to that manuscript but we're also trying to have it be as reproducible from the other direction also. We want people to be able to read the manuscript and trace all of those numbers back down to the source or the raw data file. Additionally, we also want to make sure that the manuscript itself is reproducible. So anybody can take this code, put it into their r and run it and have it pop right back up. So that's that's important to us also. Do you have anything else to add on that, Tim? All right, so I mean and that's that's the process of writing up the report and at that point it will go through peer review again with elife before it's finally published and this is just an example of one of the ones that is actually now published and available on elife. Not only can you find it on elife but you also have that PDF sorry of the final manuscript as well as the link in the manuscript to the project page for the osf which will break down all of the components in a way similar to what we just talked about. So once we have all this project wrapped up we have all the data, we have all the analyses, we have the manuscript written. Our final step using the osf is to go ahead and make this public and that's as simple as pressing the make public button above and you don't have to do this. I'm sure there are a lot of studies on the osf who use this as an organizational tool that don't make them public but obviously it's important to our process and our project to go ahead and make this public at the end so that anybody can dig through the data or the analysis that we talked about today. Okay so let's see if you have anything else to add. We can start taking questions, talk a little bit more about the process or really anything involved with rpcb. Also there are additional webinars about the osf and the one that I spoke about previously with using r in the osf and that can be found at this link below. So we had a question from Vishal. How much information was missing in the method section? Also how long did it take to get that missing information? That's a great question so in terms of trying to quantify how much is missing that something we'll try to get into at the end of the project is you can imagine that is going to end up having it be a little bit subjective only in the sense of you know how do you how do you classify how much is enough but we will try to put this down into certain bins in terms of you know data analysis materials mostly focusing on key ones to methodological details and kind of trying to trying to figure out some way to look at that across it. One simple answer is every single paper we had at least you know at least one we had multiple questions per experiment for every single paper and that's just what we asked. There were a lot of times when the author you know when we had engagement would actually give us even more information because we didn't even understand we didn't even know to ask certain questions because that information was lacking so so that's you know clear in terms of the sense that we know that there's a lot of missing information. Your second question which is asking about how long did that take us you know that was actually in many ways a very troubling aspect to the project for trying to get it to this point a lot of people you know thought that we'd be able to get to getting these results a lot sooner but we obviously wanted to make sure that we were taking our time engaging and working with authors when they were willing to work with us to gather the information and you know it was actually again very surprising especially when we got great feedback from authors who were willing to give us you know the raw data through their analysis scripts and all those detailed methods. You know this could take months we had a couple where taking off a year almost to get to the point where we had the initial email sent out with our you know kind of first draft with all the questions to that final submitted version and that's because there's so much backing forth that had to go in terms of trying to really make sure that one we understood it and two that they had the time to actually go back and sort through where that information was because not surprisingly when you know at this point these papers were published and almost everybody who worked on it has left the institution and so even though they have access to the lab notes you know they still wanted to go back to the source and actually say oh do you actually know where that's located and I think that actually just kind of speaks a lot not so much to oh that's wrong right it's more just what's our culture like I mean we're not really set up for this as much as maybe we could be another point that's wrapped into this is also materials so there's a lot that's commercially available and that's fine and you know if it's not available anymore you can always find the substitute for most things but what was tricky also for us was times when we had unique reagents so take a plasmid as an example of that or a cell line there were times when we would have you know you know almost again up to a year for MTAs to go through um and so you know again I think that speaks to how can we set ourselves up to start putting them in those repositories you know there are cell line repositories at ATCC or you know other ones that are out there and ad genes are great one for plasmids um and if they had been there that process would have got a lot faster because they're already set up to do that type of sharing um and I you know again I think it just kind of speaks to that's kind of the way we're currently doing things and how can we kind of speed that up so yeah it's a really good question because in many ways that's kind of the heart of it which is how long does it take to kind of fully understand what's what what we're publishing right now and the truth is to really understand it takes a lot of time in terms of reading it but even more in terms of trying to get all these details no I mean I think that gets exactly to the efficiency point that we were talking about earlier I mean this is it's it's not only time consuming for us but it's time consuming for the original authors we think that uh taking these steps to make our data open and at least as available as possible will help to mitigate some of this if anybody wants to dig into our work okay so we have another good a really good question Tim is um another Tim is asking this question will it be common practice to try to reproduce a published study from the raw data without first contacting the original authors um or will it be common courtesy to allow the original authors to assist with the reproduction of their experiment so that's a great question um you know this our projects in many ways is already going through both of those we don't have full 100% author engagement some you know chose not to engage with us and there's at least one example where they actually wanted to be arm's length from us because they think that there's more value to letting us kind of go at it without their input and basically using it just from the paper the raw data however that is um you know this gets back at there's a there's a lot of people have written on this work and what what is the what are the norms of replication um you know in many ways I think uh we need to get as close as we can to putting as much as we can into the original paper without contacting the authors because that's the best way that we can move forward um you know I think of that scalability efficiency approach uh if you have a paper and or even just a data set um or methodology that's going to really reach and impact a lot of people you want to be able to do that without them having to come back to you because otherwise you'd spend all your time answering those questions but I do think that you know especially with the emerging technology um which you know again we tried to avoid with this project but it's still the case um or really exciting emerging techniques um I think it is nice to still obviously have the ability to communicate um with the authors be that via email phone or in conferences I don't think we'll ever get away from that especially when somebody doesn't understand something completely um or if they try and they see something um they're having kind of especially early issues in the beginning so I think we'll always end up probably having that that norm um but I do think that the closer and closer we can push to doing it without having to contact authors that's really good there was a paper published in 2010 or 12 about then by Tim Vines um and he was looking at trying to get access of information what after papers were published so they weren't they weren't actually going to get the information um they were just curious how easy is it to obtain that information and so what they did is they set up um through a handful of papers um and they did it since the paper was published trying to do this over time and they were basically just saying well one how often are those email addresses the corresponding author email addresses correct in papers and then two if I actually get through and I can send an email um what type of responses do I get do I actually have responses where they say oh yes I have that information I'm willing to share and what you found is that within five years it plummets that being able to find the right person and being able to actually have them still be willing to engage drops dramatically so I think that's all the more reason to keep pushing towards this is because um you know that expectation of thinking that I'll always have all that information um that's also not realistic as well okay um there's another question is osf planning to rank publications based on the completeness of the information available for reproducibility that's a really cool question so we we're not um as far as I know we're not we're not looking to rank publications but we do have something that I think you'd find very interesting that's related to that question which is the transparency and openness promotion guidelines so this is a journal initiative largely a journal initiative it's also focused at funders and these are policy changes that journals can make to increase the transparency and openness and in essence that leads to reproducibility right the more transparent and rigorous we are the more reproducible we are in terms of that computational empirical standpoint um and uh we're working with journals to implement that because then that not so much allows you to rank the publications but it does allow you to get a better idea of what our journal policies and how do those papers um uh how are they reflected uh or at least sorry I was going to phrase that how those how those journals uh those papers um are viewed in light of those policies right so you can imagine that if you have a journal that says yes you need to have all your code and all of your analysis scripts publicly available at time of publication those will more than likely be more reproducible uh the cool thing about that is we can actually start to test that as these policies get implemented right do these policies actually lead to more reproducible research and I think that'd be a really fun question um another way of doing this so again not ranking but trying to acknowledge is uh we also have badges mozilla science lab has badges as well and we're working with groups to figure out a way to reward these type of practices um and so these are not done by the journal in terms of a need for acceptance right you don't have to make your data available to have your paper accepted but at the time of publication if your data are public you can get awarded this badge and it really is it's it's kind of like a gold sticker um but it's an easy way to recognize that open practice and so that's another way to kind of not so much rank but at least allow you to easily see well what papers are um kind of abiding by these best practices because that's an easier way of assessing that instead of having to constantly spend all this time diving in and saying do they have the data available where are these you know where can I find it um this allows a quick signal to say oh yes that paper right there before you can read the paper I know that they've made their data available or I know they pre-registered their studies beforehand and then we have another question from Adam a great presentation project have we received any feedback from original authors or the journal that published the article where the original study hasn't been quite replicated it's a really good question um so yes that's a really good question um so there's a couple things so we at all times throughout the project are trying to engage the authors um that's the that's kind of the practice we decided to take to that so again getting back to that question original you know is that a requirement or not that's a good debate to have but we decided to want to engage at all levels because we know that we'll have the best chance to succeed if we do um and of course at the end you know once the results are known once you have the results um that at that point it's more of speculation of like well what might have what might account for a discrepancy if a discrepancy exists which is what your question is getting at and so every single author um whether we had engagement with them or not um we made sure and we did this jointly with elive we made sure that they had access to these papers prior to them being published um so again one thing to note back that I didn't mention in the registry reports model is not only do we try to engage with the original authors uh to kind of informally gather this information but elive also had them engaged at least one original author engaged as a formal peer reviewer of both the registry reports of that methodology and the replication study the results but on top of that we also shared the paper just to really try to make sure we we crossed off all of our you know possibilities of not having that communication go through and at that same time elive offered offered for those original authors to comment on the paper um which at least for these first five three of the five authors I believe commented on the papers and you can see those comments those post publication comments and of course add your own um elive as well as on public public comments and so basically the feedback that we get I think is very good feedback right they're basically looking at those results and trying to put them back into the context one thing that we definitely want to make sure that's noted in this is just like the original study is a single data points a single study um you know because publishing direct replications is not the norm you know this is another and so we need to start having more of an understanding that these are two data points and if they are in disagreement with each other on some level um it doesn't necessarily mean one's right one's wrong that is a possibility right false positive false negative a more or error um but it also points more to well if there's a discrepancy why we didn't have we didn't have a reason to think there was going to be and if there is that actually potentially leads to the opportunity to say well how can we remedy this because you know quite possible that they're both correct but we're missing that information um we've not heard from the original journals um you know we've not purposely reached out to them we gave that we suggested to the corresponding authors when we shared information with them that you know they should seek whatever stakeholders they see necessary to share this with you know be that other authors of the original paper journals funders whoever they want but we purposely did not seek those just because you know again this may or may not have any reflection of the journals um this is more of just the sampling of the literature that we're pulling from but that's a great question thank you do you have anything you want to ask me no it's great um so another question is asked why is information usually missing could it be deliberate a lack of space like a word limit for uh you know a certain publication format that's a really good question um so there's a lot of answers to that so I can tell you one that we've seen um you know I I know from my own personal experience but we also had a come up with a couple of replications is kind of a quote that I had in the beginning which is that nuanced got lost during revisions during space restrictions it's isn't that amazing that even now with you know almost every single journal even though they still you know some of them like nature science still do prints they all have online and they all have supplemental space that we still have restrictions on terms of number of figures and space for methods or even just any other type of detail um so I think sometimes this is occurring not because somebody's doing it on deliberate just because right now there's so much time spent on formatting and getting these things into uh the necessary conditions to get it submitted let alone accept it into a paper that this information starts to get lost pretty quickly right we have such an emphasis right now in our culture on those outcomes which are which is important but we forget that the why what you're asking is is is left behind um and and that's just as valuable to understanding the outcome as the outcome itself um so I think that that gets you know speaks back to how can we you know kind of shift those incentives um by having either journals require it um or probably what's easier is to create more space and in some ways that's what we're doing even though elife is quite liberal with their space it's actually quite hard to even capture all that information into a digestible format such as a text file and so as Alex was kind of pointing out what we're trying to do and I think I'd recommend this for anybody is start to augment your um publication um with other repositories other means to share that information so that way that why doesn't get lost um especially since that why is always done before you write the paper up for publication um now in terms of deliberate you know it's really hard to ever tell if it's deliberate or not I think there are times especially in bio med where we see space where IP comes into play um we had that occur with a couple of hours but it was never a sticking point because if there was some IP where they couldn't say share a material or share a process with us they usually share that downstream material with us so you know we can't quite ask how reproducible that you know intellectual property is um but that but there's always going to be some limitation on every single experiment that we're doing um so that's not so much deliberate because they're trying to hide but that's deliberate because they're trying to protect um I think also in a more abstract sense there's a discussion happening right now that'll probably be field specific as to what actually is the level of detail necessary to share in these papers and I think that's an ongoing discussion that that we're going to need to look more into because um you can imagine there being a variable that we didn't understand was necessary to replicate a study until you know we're halfway through the study so I think that's also a conversation that needs to happen and is happening and will continue to happen right and I think that's something else that everybody can engage with um because you're absolutely right there's this question of how how much is enough um and you know again I think there's a couple ways to think about that one is well there probably is for certain types of method methodologies where there is a certain limited threshold and people have started to define those um I think what might be the healthier way to think about this is how how much can I capture right so even though I think it's important I don't know who else is going to use this and it could be myself in the future or it could be somebody else because we're obviously doing these experiments not because we're trying to say oh I found something it's this is how I found something um and so the more that we can share and the easier we can make that the more we can automate it I think these these questions will start to shift over time um and right now I mean think about this we can now have online repositories like you know the open science framework supplement the way we publish that wasn't possible you know 20 years ago or just you know kind of was it wasn't possible um and we that's why we use peer paper publications and that's why you know we we only had limitations because we were always putting things in print we couldn't publish all of the material because it'd be impossible to do that and disseminated through you know actual physical journals um so that already tells you that technology continues to play a role in this and we should be embracing it more and more and I'm sure in another 10 years we'll have different questions but still around similar points is these uh-huh there's a good question uh so to ask if you encountered an error in one of the stories would you try to remedy the error or would you proceed to replicate including the error um and then of course the question becomes it would still be a replication of findings of sorts if the error was not remedied so that's a really good question um and you know if there was a slate article written about you know this project but also kind of about you know replication in general that was trying to hit at that same point den Angber wrote the article it's a nice one to read and um I remember talking to him at one point he asked something very similar which is uh he was trying to think of it in terms of like cell lines he's like if the cell line's contaminated do you follow through with that contaminated cell line or do you stop and say I can't replicate it because it's contaminated or you know or do you remedy that and then move forward and recognize that maybe the reason you can't see it is because of the contamination um and that's an easy that's a relatively easy one because there's a norm that says don't use contaminated cell lines it gets a lot trickier as you start to go down into all these different types of techniques and approaches our general approach with the project has been to do everything the same way because we want to be able to understand even given the fact that we don't fully understand or the authors might not have fully understood at the time they did the study limitations of their materials such as and we'll say an error in terms of you know antibody they used was the wrong detected the wrong antigen we still want to repeat that mistake because we want to basically say even if they found that error can we repeat that error um generally though when we found something that was uh not something like you know I'm missing I'm missing control for instance uh generally we would include that or the poor peer reviewers would ask us to include that because there is that fundamental problem of saying well geez you know maybe that original should have had this they didn't do it but you guys need to do this because otherwise now we're we're completely missing this um and we always try to include those in there uh because we see that you know there and there is it's always true for replication you can add value not just about that original effect but you can also start to augment off of it uh we of course have to balance that with the project because you know we can't do that for everything that's better for the community but we do try to pick that up on certain issues where you know there is this kind of error that's missing sometimes what we'll see is we don't always get all access to it but sometimes you'll see you know statistical errors for instance um we only had a couple where it was completely the wrong test and when those occur you know we would not obviously repeat that error because that's just completely incorrect um generally what it would be is a disagreement about ways one could do it so we'd follow the same approach which is we'll follow the way that you did it um however since there's not a norm in the field um or there's maybe a shifting norm in the field we'll also conduct it another way and make sure that we're powered for both of those analyses um so again always try and about do this balancing act of saying can we do it exactly the same way while also recognizing that you know we will potentially be adding in new methodologies um so I hopefully that answers your question um it does make a really interesting point which is when that when that occurs right what are you comparing um and again whenever we do these um we always try to make sure that we compare as directly as we can from the original in the replication because that's where the value is gained and if it turns out you added an additional control and it sheds new light on it but that wasn't done in the original um within what you've done this you've done a you know it's hopefully a good thing which is you've added more knowledge another question is is that reproducible itself I don't want to add anything else yeah no um I would just add that anything that we do tweak as Tim mentioned um we do try to surface in the same way that we would surface any of this confirmatory these confirmatory analyses that we're trying to reproduce we would also surface these exploratory um additional variables that we sometimes add and I think that's a really good point because um you know especially in biology we're great at doing conceptual replications and hitting kind of things from multiple angles which is vital it's vital for us to understand things and generally when we see somebody's methodology we tend to quickly adjust what we see as flaws in it or as we see as just you know hinges of like why did they do it that way you know I should do it my own way um and that's really helpful that's really good because it kind of gets at that robustness and it gets you at that understanding um but something that's important is to remember that you know we can sometimes be you know led the wrong way and that's why doing things the same way is actually very helpful to get that real clear understanding of what was that actual effect can I obtain that that actual effect because otherwise if we keep shifting it which is important we keep making this assumption that each time we do it it's true but we don't know that unless we actually test it um so it's a fine balancing act and I think it's always important to consider both so thank you all for coming this afternoon as Tim mentioned the webinar will be uh has been being recorded we'll post it on the Center for Open Science YouTube channel where you can also find um all of our old webinars as well as the OSR webinar that Tim mentioned um if you have any questions after the webinar feel free to shoot us an email contact at COS.io we'd be happy to answer it um so thanks so much for uh do great questions and for joining the webinar this afternoon