 Great. Thank you, Brian. So today I'm going to talk about how research is cumulative. New investigations build on, challenge, or qualify claims made on prior evidence. Explanations for observed findings are often wrong, and the process is made by a constant dialogue between explanations and evidence. It is normal that explanations are wrong and constantly being interrogated for how they can be refined and reconsidered. This is a self-corrective process, and it's a hallmark of the research. And a key part of a healthy self-corrective process is that the evidence used for generating and refinding explanations is credible and trustworthy. But there's concern that there is little opportunity for self-correction to occur, because there's a weak foundation of evidence upon which progress can be built and allowing explanations to be generated that explain phenomena that either don't exist or we have inaccurate explanations to explain phenomena that do exist. And during the last 10 years, a lot of progress has been made by evaluating the credibility foundation in research. And across disciplines, teams have evaluated the credibility of evidence by trying to repeat it. So today I'm going to talk about four different approaches to repeating evidence. I'm going to give examples across disciplines at each stage. And the first one is process reproducibility. So this refers to the ability to have access to the underlying information that supports the claims that are being made. And if underlying data, code, and other research materials are not available, then it's difficult or nearly impossible to evaluate whether the findings are reproducible, accurately reported, or credible. So here's a study that looked at the accessibility of data in articles published in BioMed Central Journals. The most common data availability statement was to ask for the data from the authors. However, when asked to make the data available, many did not respond or declined. And that means just seven percent actually provided the requested data. This means the vast majority of papers being published could not be assessed to see if the findings were accurately reported. And in the reproducibility project Cancer Biology that focused on preclinical research, we observed similar things. So we looked at the accessibility of data, code, research materials, and clarity of protocols. And we also assessed authors' healthfulness in the process. And if accessibility and sharing was high, what we'd expect to see on the screen in very little or no yellow or red. But with every aspect, barriers were encountered to accessing the underlying information of published findings. In ecology, this study assessed published papers for data and code accessibility. So the positive side here is that the policies and norms for data sharing have matured so that 79 percent of articles published have available data. However, code sharing has lied behind. Only 27 percent of those papers had code. That means together only 21 percent of articles made both data and code available, which means only a small proportion of the literature that's being published had the needed output that could enable someone to see if they could reproduce directly from the researcher's work. And these reproduction attempts with only data are more challenging because then if you don't have the code, you have to extract the information from the text, which is itself a very challenging or impossible task. And in the SCORE project, this is a project that we assessed the credibility of overall science literature over a 10-year span that included over 60 journals. When we assessed the availability of data and code in these articles from this corpus, we observed that more recent papers had more content available. And that's because accessibility inevitably erodes over time, but also because preservation and sharing practices have improved during the last 10-year period. But if we split the data differently and look at discipline instead of time, we see a different story. This illustrates that those improvements in economics. So this implies that sharing practices can change through technological normative and policy interventions, but those changes can only occur in communities that deliberately change their culture and implement those practices. Okay, those examples illustrate shortcomings across fields and sharing sufficient information to make it possible to assess reproducibility. There's also a lot of research that actually looks at what's the case when data or code are shared and whether those outcomes can be reproduced using the original analysis strategy. So in principle, we'd expect the literature to be 100% true. All we're trying to do is assess whether the paper accurately reported what the outcomes were of the analysis. What I'm going to show you is no studies ever achieved 100%, and many fell very far short from that. For example, here are three studies in economics that identified a sample of articles that are counted based on how many had available data and codes that they could attempt at reproducible findings. So then they counted from those attempts how many were successful at reproductions of the original findings. So you can look at these two ways. The reproduction success rates were low either way, though, whether you looked at the reproducibility rate for those that were attempted, so that success rate for attempts or for all the articles counting the lack of access as a failure as well, so the success rate for articles. So sharing data and code does not itself guarantee the reproducibility of findings. And in cognitive science, this study examined the reproducibility of findings published in the journal of memory and language under a brand new data sharing policy. So while the policy increased the data sharing, as you can see on the slide, reproducibility rates were lower than desired when code was shared and even worse when code was not shared. This replicates what I just showed you in economics and adds that the situation is worse when data are shared alone. So here are the claims that used electronic health research records. So here they purposely did not use the code. They recreated the analytical strategy based on what was written in the paper. And if reproducibility was high, what we'd expect is all the columns to cluster in the middle where that one is. But instead we see that there's wide variation between the original findings and the reproductions and sometimes quite extreme. And finally in the score project we looked at the reproducibility articles that had access to data or data and code. And if code was not available we'd reconstruct the analysis just like I shared before. And in no discipline did we achieve 100%. Even if we relaxed the criterion to only needing to be within a 15% margin of area so that approximate we still did not achieve 100%. So collectively the first takeaway from outcome reproducibility is it's very replicable across disciplines. The second takeaway is it can change and some fields are doing it. All right, robustness. This is a third way of assessing credibility. And robustness refers to reanalyzing the same data from the original finding by considering reasonable alternatives to the strategy, the analytical strategy. So for example there might be different rules for treating outliers or different decisions about inclusion criteria or different variables to include in the model or different ways to combine variables to measure and we would not expect all of these different analytical decisions to give us the exact same decision, the exact same evidence, but we do expect them to be reliable and robust in terms of what we conclude from those original findings. Here's the many analysts study. This had 29 teams that used the same data set to address the same research question whether soccer referees are more likely to give red cards to dark skin tone players versus light skin tone players. Each team here developed independent analytical strategies to answer this exact same hypothesis. No two strategy was the same and there was a variation in terms of what was actually observed for the findings. About two thirds were significant and positive, that's the green dots you see there. And about one third were negative. Those are the gray estimates. And there was substantial variation from the smallest to the largest observed effects. Here's one that looked at COVID spread early on. Nine teams used the exact same data to estimate the reproduction number of the virus. The estimates varied from one's confidence interval overlapping with zero on the left to one that had a high rate of spread on the right. So in a typical paper the variability associated with this these analytical decisions is invisible. Usually we just get one analysis in that paper. And it's heavily dependent upon which one is picked by the authors, which one they report. So from a single data set like this one might conclude that COVID is either spreading fast or that the pandemic is receding. Here's a last example from neuroscience. 70 teams here analyzed the same neuroimaging data set to test nine distinct hypotheses you can see on the bottom. The y-axis here shows what proportion of the teams observed a significant result for each of these hypotheses. Only one H5 on the right did all the teams consistently have a positive significant finding. While three on the left were consistently not non-significant findings across teams. But you can see that there are five that sit in the middle that range between 20 and 35 percent of the teams reporting a significant result. So collectively these robust studies tell us that analytical decisions are an unrecognized source of variability. For findings and the lack of visibility of that variability is a threat to the credibility of any one analytical strategy. The last indicator I'm going to talk about is replicability. So this refers to using independent data from the original study to see if we can obtain similar evidence for that original finding. And there are many reasons why we wouldn't expect these to be the same. A lot of this is in terms of interrogating what's necessary and sufficient to obtain a finding. And if well powered though a good replication, a good faith replication that does not obtain similar evidence suggests that we don't yet understand the conditions necessary. And that should lower our confidence that we have in terms of the relationship between the evidence and our explanation for the evidence. So for example this is from the reproducibility project in psychology that attempted to replicate 100 findings from three prominent journals in psychology. The X axis here shows us the original effect size. The Y shows us the replication effect size. And in an ideal case, that diagonal line is where we'd expect the clustering to occur. However, we observe a lot of variation between the original and the replication and a lot of it hanging out below the diagonal line suggesting that the replication findings are weaker than the original. And on average it's 50% smaller than the original. Using another measure of replication beyond this we can also look at was it statistically significant in the same direction. There we see only 36% of the replications were significant in the same direction compared to 97% in the original findings. We found similar results in the reproducibility project in cancer biology. So for here we actually saw that the average effect size was 85% smaller than the original. So to put that in perspective, this would be like an original finding. Found a potential therapy that increased the average tumor-free survival in mice for 20 days. It's pretty typical. But in the replication it was three days. So this has a huge impact in terms of the confidence someone's going to have as we're pushing this down that drug discovery pipeline. And finally in the score project here we completed 153 replications across these disciplines. And here replications determined like it was in the psychology project that I gave you as an example, statistically significant in the same direction. We can see that across all these disciplines and similar to the other efforts about half of the original findings fail to replicate using this criterion. So just like the reproducibility studies and the robustness studies I shared with you, systematic replication efforts suggest there's room for improvement if we're really going to be able to repeat the results of the published literature. And this is important, right? It's really important because it tells us the implications we're going to have on terms of what we use these findings for to solve real problems. So here's an example from the ALS Therapy Development Institute. So this graph shows us between the original published findings, these green bars and what they found in their replications, the black. And in all these cases the replications fell well short of those exciting published findings. Also, all of these had disappointing results in human trials. So this leads to wasting time, money, animals in this case to try to build evidence that was not as reliable or credible as it was originally thought. And this also means pushing false hopes into having these go into clinical trials and experimenting when we might not have done that if we had known these replicability rates beforehand. So the evidence that's been provided by all of these systematic efforts of reproducibility, robustness, and replication suggests there's a lot of room for improvement. And if we can improve the foundation of the evidence then we might actually be able to reduce the friction and increase the pace of discovery. So let me give you an example here. This is looking at the drug discovery pipeline from target validation on the left to when drugs actually get approved during the market on the right. If we look at the success rate of all trials across all different disciplines it's a 90% success rate from phase one to drugs that are finally approved. And this only takes into account the failure rates that are known which is only in the clinical trial stages, 1, 2, 3. We don't know anything about the pre-clinical stages so the situation is actually much worse than 90%. So why could this be? Why is this the case? Why are we failing so much? Well, one is it's hard. Research is hard. We're trying to translate the animals we use, the models we use, don't translate to human disease, don't translate to humans as readily. So that's why we have to keep doing investigations and keep further investment in that. So what we do is we're pushing things into that pipeline fast when maybe we should be interrogating the credibility especially between the pre-clinical and clinical divide here. So we're not having enough confidence in these things or we shouldn't have enough confidence to push them too fast. And overall this increases an opportunity. This is an opportunity ahead of us which is to optimize the efficiency of the self-corrective process of science. So if we use methods like I just presented failing is necessary to finding what's right but what we need to be open to is making observing failure possible. Thank you.