 Most published research findings are false, according to a recent paper in the Journal Plus Medicine, written by John Yonides, a professor of epidemiology at Tufts University School of Medicine. Well, I suppose we had a good run. I'll inform the other scientists we can start packing up our desks. Science as a human endeavor is dead. We'll just have to find something else to do. Real domination has always held an attraction for me, and I think I might check the one-ads for something in that field. Start out as a minion and work my way up to Evil Genius. This kind of headline is, a, completely accurate, b, completely misleading to non-scientists. In fact, if someone quotes this at you, it's a good chance they've never read it, and also a reasonable chance that they're on the road to science denialism. I want to discuss what it actually has to say. First, I want us to pick a medically relevant hypothesis. Let's say we decide that we hypothesize that black licorice can be shown to significantly increase cancer in rats. We actually need to set up two competing hypotheses, the alternative hypothesis, which we just stated, and the null hypothesis, which is that black licorice cannot be shown to significantly increase risk of cancer in rats. The word significant here has a specific meaning. It doesn't mean important or meaningful so much as it is unlikely to have occurred by chance. If we feed licorice to 1,000 rats, and no licorice to another 1,000 rats, the number who get cancer is going to be affected by chance, and also possibly the treatment. If we, for example, find that 50 of the control rats and 55 of the licorice-fed rats get cancer, is that a significant finding? Or was it simply what we should have expected, given a normal random distribution of cancer risk? What about 50N60? Is that enough difference to say that black licorice is a scourge to mankind? Ronald Fisher, the evolutionary biologist and statistician, provides a way of answering this question. He set up a way to express whether something was significantly above random chance. First, we define our hypotheses, as we've already done. Second, we clearly define the statistical relationships and assumptions. Third, we make the observations and apply the appropriate test statistic. Fourth, we make a decision on the outcome of the test statistic, and we reject or fail to reject the null hypothesis. We always start from the position that the null cannot be rejected, in this case that licorice is safe. If it were a new drug, the null hypothesis would be that it is not safe, or that it is not effective. Only successfully rejecting the null would be a vindication of that new drug. Our prior assumption for something we've never tested before is that it doesn't have the desired effect. Our test statistic today, and this is a common one, is the student's t-test. It was developed by William Gossett, an Oxford graduate, a chemist working for Guinness Brewery, as a way of monitoring the quality of the stout being produced. It's not called Gossett's test because Claude Guinness didn't want his competitors to know he was using statistics to monitor quality. Instead, Gossett used a pen name, student. Fisher actually created the form we use today, but it still bears the pen name of a secret brewery mathematician. There are certain assumptions to a student's t-distribution, small sample size, unknown standard deviation for the population, and it looks a bit like a normal distribution, but with heavier tails on either side. We take our population of rats and apply the test statistic. Now we need to decide at what level we will consider the results significant. Fisher proposed a standard 5% or 0.05 significance level, a value denoted as alpha or 1-beta. What does that mean? It means that there is a 5% chance that we will falsely conclude that there is a difference between the groups by chance alone. Take our example with 50 rats in the control group and 68 rats in the licorice group with cancer. We conclude that the difference is significant, but at this level there is always a 5% chance we are falsely concluding there is a difference when no difference really exists. Out of every 100 such studies, we will get a false conclusion in 5. We will call this the false positive or type 1 error. There is the opposite error. This is where we reject the alternative hypothesis in favor of the null. Even though there really is a difference. The drug is not safe, not effective, or the licorice is not a carcinogen when it really is. This is a type 2 or false negative. The false negative is strongly affected by the power of a test. In our example, suppose we only had 10 rats in each group and exactly 3 in each group got cancer. We might conclude that there is no significant difference between these two groups, but that's because the noise level contributed by randomness is so high. If we repeated this 100 times, we might never be able to detect the very real 20% difference in cancer rates between these two populations. So there are two ways that we can be wrong in every test of a hypothesis, and two ways we can be right. What can we say generally about the numbers of false and true alternative hypotheses? How many false hypotheses exist and how many true? There's a very large number of hypotheses that can be false, but only a relatively small number that can be true. We can hypothesize a lot more things than that can actually be demonstrated to be true. So in our four quadrant display, the majority of null hypotheses are likely to be true. This makes that small 5% false positive error a pretty major contributor to coming to the wrong conclusions. I'm going to borrow an example from an excellent article by economist Alex Tabarak on his blog, Marginal Revolution. Suppose we take 1,000 hypotheses. Assume that most of them will be false for reasons we just discussed. 200 are true and 800 are false for our purposes. Our level of significance allows 5% of the false hypotheses to be true, so at least 40 will be false positives. From the true set, let's assume the power of our tests allows us to pick up 60% of the 200 true hypotheses. That gives us 160 total cases where we fail to reject the null, or if you prefer where we support the hypothesis. But only 120 of that 160 are true positives, or 75%. The other 25% of the results are getting us the wrong answers without any need for publication bias or researchers with conflicts of interest. There's more to it, of course. This was just where the objective numbers get us. Bad research can also bias results, and so can bad researchers. Our expectations, our conflicts of interest, are simple bad statistical assumptions. One of the most important elements is the economics of sample size. It's cost prohibitive to run trials with thousands of animal models or enrolled patients, yet that's what's needed when the effect size is relatively small. We often do good enough research and add caveats to the conclusion saying that further study is warranted, or calling the study a pilot study. So what should we do? Should I go get fitted out for my minion uniform of a spandex jumpsuit and hard hat with a stripe down the middle? No, and none of this is surprising to me. I think most young grad students learn this intuitively by conducting their own research and following the literature. Results are variable, and what starts as a very interesting outcome is less interesting when it's repeated or when a larger sample size is used. Engaging in research has a way of grinding out the bright-eyed optimism and surprising results. Even if your understanding of statistics isn't very deep, we learn to live with disappointment. So when someone sends me a single study in support of a surprising contradiction, I'm always very skeptical, even a little cynical, that this single outcome has destroyed our prior knowledge on the topic. Likewise, numerous small studies, each of them inappropriately small for the effect size, do not add up to a strong conclusion. What is needed is some way of applying a little extra caution for the really outrageous conclusions, and some have proposed just such a mechanism, Bayesian prior probability. This is a slightly subjective way of measuring what the likelihood of an alternative hypothesis is. Those that represent major departures from existing models need a stronger effect, a larger population, or less contribution from random noise. It's a way of codifying the basic skeptical rational principle expounded by Carl Sagan. Extraordinary claims require extraordinary evidence. When we apply this to concepts like psi or precognition, which frequently find support for their hypotheses, but require a new theoretical model in multiple fields of science, it allows us to cut through 90% of research published with small sample sizes, or very small but significant differences. Scientists learn very early on in their careers that individual papers are interesting, and can lead to marvelous discoveries, but cherry picking single studies is never as reliable as looking at an entire body of literature on a topic. A single result, unless it is a truly definitive study, is not very persuasive when stacked against dozens of other papers with contradictory findings. In spite of what you may have been taught, a single documented exception to theory is not sufficient evidence to overturn it. What is needed, most of all, is repetition of good research, skepticism about weak or small studies, skepticism about very small effect sizes, and a good theoretical underpinning to the finding. We also need to learn the difference between evidence-based medicine, where a difference can be demonstrated between treatment and controls, and science-based medicine, where not only is there evidential support, but it also agrees with a theoretical or mechanistic understanding of the disease, or condition, or pathway. Surprising results we don't understand the mechanism of should be regarded with more skepticism. There are many documented cases where a researcher made bad choices in experimental design either intentionally or through simple bad judgment. I think the effect of researcher bias is a lot less prevalent than most people think, and researcher incompetence, or simply sloppy design, is a lot more common. We scientists need to be a little bit more strict about the misuse of the p-value statistic, or measurements of significance as the sole determinant of whether a result is real or not. There is more to research than simply observing statistical differences. It should certainly inform further research, but not be the end of it. We also need to get a lot better at communicating these concepts to journalists and policy makers and the general public. So how about our poor rats fed that nasty licorice? Do you feel a little differently about the importance of the outcome of our study? Even if we find a significant result, that result on its own is probably not enough to ban licorice. Politicians and journalists, like most of the general public, don't get this, and so policy choices are often made on bad understanding of research outcomes. I hope when you're presented with some similar finding on a new cure for cancer or some new association is found between people who watch YouTube videos and IQ, you'll apply a little bit of extra skepticism. Yes, many or most research findings are false, but the more good studies there are on a topic, the better the chance that we'll converge on something like the correct result. Also important is that prior probability, the theoretical basis or mechanism by which something might have occurred, extraordinary claims require extraordinary evidence. And if you find someone who's quoting Ioannidis, check to see if they even understand why findings are false. If they yammer on about corrupted researchers, industry meddling, political bias, and global conspiracies, you know they never read the paper. The real reason why studies are false is much simpler. Scientific testing is not perfect. It is a human endeavor done in the darkness of ignorance, but building a useful tool to light our way by numerous missteps, by fruitless pursuits and false confidence. Ultimately though, it works better than any other method we have so far discovered. We just have to learn to manage it better, to compensate for our failings, and to accept that it's not a simple way to comprehend the world around us. Thanks for watching. Every cell of each plant and animal contains genetic information coded onto the DNA molecules.