 Hi, thanks very much for coming to this event, and I'm delighted to be speaking with you today. The common interest that we have is in accelerating discovery. How can we get to the most knowledge, the most evidence for solutions, the best cures as quickly as possible with the fewest resources, or whatever resources we have available? What I would like to do for my session is talk to you about the fundamental role of replication in accelerating discovery and advancing knowledge. We know from the range of research over the last 10 years about the culture of science and how it is we make progress, that there are a number of dysfunctional elements that get in the way of accelerating knowledge and discovery. At the core are these incentives for producing novel positive tidy results. Researchers are rewarded for the outcomes that they produce rather than the process and the effectiveness of the methodology that they pursue to advance discovery, whether it ends up being real or not. Those dysfunctional incentives have a number of implications for what it is researchers end up doing and then the credibility of the research that ends up being produced. For example, if because novel positive tidy results are the priority, we are likely to ignore, selectively report those results that are nulls, that don't find that the treatment was effective for the outcome. Likewise, we have strong incentives to employ questionable research practices. If I have multiple ways that I could analyze this particular finding and some of those ways are more publishable than others, then I may be more likely to select those findings that look better for publication, even at the cost of credibility, perhaps not even realizing that I'm doing so. Relatedly, there's no reason for me to share the data, the materials, the process that I went to through to obtain the results that I report because that's not part of the reward system. Because the focus is on providing novel, exciting, positive, tidy results, why would I try to replicate somebody else's finding just to try to affirm or disconfirm what they've already shown? Or why would I try to replicate my own? Because I already have the finding. It is already novel. The only thing that the replication could do could reduce the confidence that we have in the finding based on that initial discovery. The consequence of those behaviors is that the research literature itself has lower credibility because we're not representing the breadth of knowledge, the breadth of things that we studied and tried to discover. And we miss out on the opportunity to employ the most important part of the scientific process, which is self-correction. Lots of our initial ideas are going to be wrong. That's not a problem in science. That's ordinary. We're studying hard things we don't understand. We're going to have lots of false starts. The way science makes progress is via self-correction. It's in initial ideas being promising and follow-up research finding which ones are false leads, qualifying others, and then affirming those that do provide exciting directions. The consequence of all of this is waste. We slow our progress. We don't get the return on research investment that we need. So, in all of that, I want to focus primarily for this on the role of replication. In that initial work that we do, that first discovery is an exciting moment for establishing the possibility of a phenomenon. And it might be that I have do some research and suggest that exercise improves memory. And that is a very potent, potentially powerful piece of evidence, potentially powerful claim for improving the human condition. But of course, that initial discovery is rooted in a particular set of units and treatments and outcomes and settings. By that, I mean there's a particular sample that I did my investigation. I did a particular way that we operationalized exercise. It might have been running over a course of a year rather than bicycling, rather than some other form of activity. The outcomes that we used were contingent on a very particular set of outcomes. How did we measure memory? Maybe I measured long-term memory in my initial study, rather than short-term memory or working memory. All of those particular features of the research design could be constraints on the general claim that exercise improves memory. But that initial discovery doesn't establish whether there are boundary conditions or what the constraints are on the finding. So my general claim exercise improves memory has not been evaluated in a systemic way. It's been supported by a single observation. And so understanding matures over time by continuing to investigate that question via a lot of different approaches to testing whether the relationship between exercise and memory holds. Some of those in gray don't show the phenomenon and constrain our confidence or the conditions under which that relationship holds. Others show evidence for that phenomenon and expand that dark blue area, the place where we think we understand how it is why it is under what conditions it occurs that exercise will be related to memory. And that expanding understanding in dark blue starts to overtake the white blue area of I wonder what it might happen. The I wonder what might happen is opportunities for where we think this could extend to these other places. It might occur in these things, but I don't have a strong theoretical commitment that it must based on my current understanding. And so a lot of how research advances is to triangulate on where is it that we see this phenomenon and where don't we? Now there's a key role here in the way in which replication helps. A replication is a study for which any outcome would be considered diagnostic evidence about a claim from prior research. So in this research where I'm saying yes exercise does improve memory if I have specified in that original research the conditions under which I think that relationship holds then a replication is one that meets those conditions. And as long as it meets those conditions then the diagnosticity is whatever happens with the replication will alter my view of the phenomenon. If I find consonant evidence confirming evidence of that finding then my confidence in the overall claim will increase. If my replication does not find evidence for that relationship then my confidence in the overall relationship will decrease. That is critically different than the I wonder what if type of scenario might happen. Maybe it'll happen here. Maybe it won't. In those kinds of scenarios we are doing generalizability tests. We're not doing a replication because we're not actually confronting our current understanding. If a generalizability test of I wonder if it happens over here and it does great the scope of my phenomenon has expanded. But if it doesn't occur over there it doesn't revise my confidence in the core claim. I just say oh okay it doesn't happen over there no big deal but I still believe that exercise improves memory. This is very important because the way in which understanding matures starts from that initial discovery where we have a small ring in dark blue of where we are confident. We our theoretical position says the effect will occur when you repeat these conditions under the many many different conditions we might investigate. And then in light blue we have an expansive space where it's like oh and it might occur over here too. It'll be interesting to find out. In the most positive best case scenario for the discovery going up the left side what happens as we do replication and generalizability tests is we increase understanding and the scope of the dark blue where we have theoretical commitments for where this occurs expands and over time overtakes the light blue area the I wonder what if. And so by the end when we have a straw a very mature understanding we have very clear expectations of how this effect occurs and under what conditions we think we would observe it. And that gives us many more opportunities to conduct replication tests to affirm or disconfirm our present understanding because we have a strong sense of when it ought to occur. In the worst case scenario for the discovery off to the right side every time we try to do a replication test we think we understand when this will occur now we're going to try to do it again and see if it does occur if that fails then the dark blue circle shrinks because we thought it would occur under that condition and so now our confidence in the finding itself is declining likewise as we do generalizability tests if I might be able to occur over there maybe it'll occur over there if those fail the light blue circle also starts to shrink and if we persistently fail to reproduce that original finding under conditions where we were pretty sure was going to occur then the dark blue can disappear entirely we no longer have any theoretical commitment that it will occur under these particular conditions and instead we just have the possibility of wouldn't it be great if it occurred? I really really want exercise to improve memory Most of our search is somewhere in between these two things succeed things fail understanding is coming back and forth and takes time to emerge as we do more and more studies but the productive nature of this path relies on replication because it is the occasion where we confront our current understanding to see if we can reproduce the findings based on our theoretical expectations without it we might be building a house of cards we might be building lots of individual occasions where we see some phenomenon that seems to be related but we have no consistent evidence that those are interconnected and actually fundamentally representing the same phenomenon so let me give an example of how this path can proceed You might recall that we conducted a project called the reproducibility project psychology. It was published in 2015 and that project involved replication studies of 100 findings from the psychological literature and we failed to replicate a substantial portion of them only about 40 percent or less were successfully replicated by a variety of criteria and that provoked a lot of debate about why why is it that we failed to replicate 60 percent or more of these original findings? Is it a problem of the original finding? That the understanding is not actually correct. It may have been a false positive not even a real finding or it may not be understood or was it a problem of the replication design? Maybe we failed to actually confront the finding by creating a design where replication is expected so that debate is not resolvable by argumentation. It's only resolvable by pursuing additional replication so the many labs five project was organized to try to confront some of those failures to replicate with additional replications moving up along that path as we described on the previous slide What we did for many labs five is it took 10 of the papers from the reproducibility project in psychology where the original authors in giving feedback had not endorsed the design They had identified some problems that they thought would limit the likelihood of the effect replicating successfully and indeed nine of those 10 failed to replicate in the rpp project originally We took those designs and then we revised them again through the registered reports model where it goes through peer review at the journal prior to actually conducting the studies so that we could get the expert review on the protocols on the design and maximize its quality so that it would be a diagnostic test of that original claim And so what this provides an opportunity to do is test our present understanding via expert review And to see if the explanation that the original replications in the rpp project were the parts that were flawed So what we had observed in the original studies those original 10 publications Was an average effect size across those 10 phenomenon a little bit over point three five So robust effects in those original papers In the rpp project the first study that tried to replicate them The average effect size was about point one one So this is where we showed lots of failures to replicate and much smaller effect sizes on average Now we organize many labs five And we run both protocols the rpp protocol that failed And in a number of labs the revised protocol that's now gone through this peer review process to really maximize the Diagnosticity of these tests. These are expertly endorsed or expertly reviewed Study designs that will give us confidence of whether the finding is replicable or not And what we observed in those two conditions was about the same result Those effects did not rebound and become like the original studies in terms of their effect size even after those revisions Rather we continued to fail to reproduce those findings Both in standards of statistical significance And in terms of the average effect size compared to the original So in this case these findings are on that right most path in the prior slide We are continuing to fail to replicate them and have yet to defend identify Those conditions under which this phenomenon can be observed So the confidence in the finding is declining as are the conditions under which it might be observed Now it still could be observable That but that now requires additional study to identify those conditions when we can observe that kind of finding So that's what we're seeing here is we're on this path and to get back towards the left most path We need to conduct additional replications or studies to really identify where it is that this finding Can be observed and perhaps in a much more constrained way than the original paper suggested So one of the implications of this is that promoting replication doing replication creates a virtuous cycle for advancing discovery and accelerating knowledge By doing a replication it can improve the theories that we have for why it is these findings occur The failure to replicate something provides a constraint. Oh, okay. It doesn't occur. Maybe under these conditions What are then the conditions under which we think this will occur? Successful replications expand the scope. Oh, we thought that the temperature of the room would be irrelevant We thought the time of year would be irrelevant the replication varied those things and it didn't occur It was irrelevant. No worries about that So we can be confident uh in uh the The fact that that uh Did replicate expanding our understanding Then likewise as those theories improve they provide Stronger and stronger opportunities to confront the theories for where they are limited in their explanation So having better theories of it should occur under these conditions Makes the replications more impactful for assessing those Those theories those expectations If we can promote some degree of replication in the work that we do Then we're more likely to get research domains into this virtuous cycle And accelerate understanding of the phenomena that we investigate But there are a number of barriers to actually Doing replications getting high replicability because of those dysfunctional incentives And for even trying to do the replications in the first place And an example of that is in the reproducibility project cancer biology that we've been conducting over the last five years And are just now writing these summary papers And this is an illustration of the challenges that we have in the existing literature For even doing replication in the first place So one metric that we evaluated was can we obtain the original data in order to assess And understand the reproducibility of the findings reported in the paper Particularly so we can understand for example what the effect size was What the variability heterogeneity was in the phenomenon When you look at the 197 experiments that we initially started the project with Only three of them had data available in an independent repository Almost all of the studies no data was directly available We couldn't ask authors to share their data, but then only a subset Were able to either because of desire or ability to share that data Okay, well fine. We can't get the data, but at least we know the methodology We can repeat the methodology and see if we get the same results with new data, right? Well, no We tried to design replication protocols for each of those 197 experiments And we counted how many times could we design the entire protocol Without having questions that needed clarification from the original authors We succeeded zero times of those 197 in finding a complete protocol Sometimes we had very small questions just clarifications on particular things in the protocol Other times it was hard to even understand what any parts of the experimental design were That is a very solvable problem It is a mundane issue But it is a critical issue to have enough clarity and completeness of reporting Of what I did in my original experiments so that you can repeat them if you so desire to challenge them Extend them expand on them. Whatever If we can improve reporting standards and the completeness of methodology reporting We will dramatically accelerate the ability to get into that virtuous cycle For advancing discovery It's not all doom and gloom. There's also emerging evidence that Pursuing some of these good practices on reporting and improving rigor can improve replicability of the literature The decline effect Project is a collaboration of four laboratories Where the it was designed to be a perspective replication study each of these laboratories Did their research as they do and as when they have a new discovery If it fit the criteria for the project they would submit it for this perspective replication project What would happen is that the lab would discover the phenomenon And then submit it to the project for a confirmatory test What that meant is that the original discovery wasn't the final result What the final result was is a confirmatory test where we're going to test it again by that same lab And whatever the outcome we're going to report it And for that confirmatory test it had to be pre-registered We have to say what the methodology is we have to say how we're going to analyze the data and we have to report it no matter what And once that outcome was completed Each of the other labs Would then replicate the finding by reviewing the methodology and then conducting an independent test Including the original lab would do a self-replication during this process Ultimately it produced 16 novel findings four from each of the labs all 16 of those were Had to go through these confirmatory tests And then no matter what the outcome was of that confirmatory test It went through this round robin replication of each of the laboratories And what the project embraced were these good practices that are presumed to improve the rigor and replicability of findings All of these finds had to be pre-registered No selective reporting. We didn't leave negative findings out The methods were shared uh in the description and there was a sense of accountability Each lab knew that these other labs were going to conduct independent replications So I really don't want to make sure that I report my methodology clearly enough so that they can do an effective replication What we found was that those original confirmatory tests had an average effect size of 0.269 The replications had an average effect size of 0.262 Almost the same That is a much more robust Evidence for replicability than every other test that we've conducted In the literature to date of published findings that don't necessarily have all of these good practices Overall the replication rate using null hypothesis significance testing was 86 percent of those confirmatory Tests were reproduced in the replications And that's a remarkable number considering that the average power to detect those effects was 80 percent So if all effects were true and accurately estimated then we would expect that our replication rate would have been 80 percent And we actually exceeded that a little bit Almost surely by chance rather than it actually being exceeding the limits of power But what it shows is that we essentially achieved maximum replicability in this context For these particular findings We can't unpack the particular reasons for why we observed that replicability But we can know that these collection of practices are associated With high replicability of these results So ideally this will spur additional replication studies of replication studies So that we can increase confidence of what factors are improving replicability overall All right, I want to close with what funders can do to help shift the cultural context and improve replicability For the purposes of accelerating discovery The first is an obvious one The policies that funders have for their grantees Make a big difference in what ultimately ends up being shared To provide the possibility of evaluating its credibility and conducting replications So policies that promote open data open materials open code That encourage or require pre-registrations of studies and analysis plans Will go a long way to helping to improve replicability And improve the assessment of replicability The top guidelines provide a framework for what those policies can be That are expectations for grantees just as they are applied very widely now for journals A second solution is to have rfps For doing replications of high impact findings in your priority areas as funders This can be Predetermined which findings are worth replicating right one might look at the last three years of work that has been funded and published From one's portfolio and say here are three really exciting things that we published Before we invest additional resources and then we want to put out a call For people to conduct independent replications of those to provide some confidence that our further investment will have a high potential for return Or it could be We're promoting a call for anything that's relevant to our research priorities You as the proposer have to justify why it's important. Why do we need to replicate this finding? As something that is changing the direction of research Finally There is a program that we have found to be particularly effective not just in promoting researchers doing some replications or improving their rigor But in helping to facilitate that culture change movement of shifting norms within smaller communities We call these reproducibility and transparency initiatives And the best versions of them are ones where one or more funders and one or more journals partner On an initiative that promotes more rigorous research practices Usually these take the form of registered reports Where there is a special section or a special issue of a journal Where researchers that propose a project replication or otherwise Get peer reviewed in advance of doing the project by the journal And if it passes peer review Then the journal commits to publish it the funder commits to funding it And everybody wins the researcher Gets to go through one peer review process or a coordinated one and gets both the funding and the commitment to publication The journal gets commitments for high quality funded research to appear in their journal The funders get a much stronger guarantee that their investment will actually result in reported research So many research things Projects that are funded never have any impact at all because they're never reported at all This mechanism improves the likelihood that they end up getting reported But even better These initiatives have shown high potential for having strong impact on the culture The community that's observing and reading that literature Because they get lots of attention And they spur that research community consider these new practices greater transparency Focuses on replication commitments to pre-registration We have a just funded nsf grant to conduct an evaluation of these rti's Where we will be surveying communities every year to see their changes in attitudes and behaviors related to things like open data replication pre-registration And the evaluation project will leverage the fact that there are multiple funders That are committing to do these rti's in the communities that they fund And you can see seven of them listed here that have made commitments to do an rti in a community of During the next three to five years That will be part of this evaluation there's There's plenty of room For other funders to get involved in this if they'd like to have this evaluative data From the data collection will do with nsf support and then the comparative data of how Each funder's research community that they're trying to influence relates to The communities that the other funders are trying to influence as well So there's lots of collaborative opportunity here for really maximizing the impact on replicability itself And on culture change to address some of those dysfunctional incentives So if we return to that opening slide where we identify how those incentives for positive Novel tidy results at the outset have lots of negative consequences for slowing the pace and increasing research waste We can identify how those Interaction these those interventions by funders can help to address some of these The rti's and registered reports dramatically change the incentives about results because you don't know what the results are Researchers get their publication commitment Based on the importance of their research question and the quality of their design They don't even know the results before they get their reward Those rti's and the practice of pre-registration committed in advance to what the study is How it's going to be analyzed and reported Eliminates selective reporting every finding gets reported regardless of outcome And eliminates questionable research practices because i've made the commitments in advance to what i'm going to do and how i'm going to analyze it Top policies help to address that lack of transparency or sharing by having encouragement or commitments To making the data the materials the methods the protocols available Following the research or even an advance of the research And then finally those rti's and making calls for replication studies Lowers the barrier for researchers to actually try to propose and to do replication studies themselves It can be very hard given the lack of incentive for replication For a researcher to invest the resources in doing one without knowing whether they'd be able to get a publication at the end The primary currency of reward But with a registered report i can propose a replication study just by Uh describing the reason it needs to be done in the methodology And then get that acceptance or not at the outset before i invest the resources in doing it So it's a much lower bar for me to pursue those investigations I'll be happy to have any follow-up questions and discussion about these issues or others There are a number of links here on the screen if you want particular things To to look at about any of these initiatives Thank you very much for your time and attention