 Hi everyone, today I'm going to be talking about issues of statistical power, specifically what can happen when studies are underpowered. So just to make sure everybody's on the same page, statistical power is the probability to detect an effect when that effect is true in the population. So for example, let's say I'm running a study trying to see whether men and women differ in their average height. The statistical power of my study tells me how likely I am to find that effect. So for example, if I have 80% power, if I run 100 studies, so I draw 100 random samples from the world's population, 80 of those 100 studies will come back as statistically significant. So in many disciplines, 80% power is considered the lower bounds of acceptable power. 90 or 95% power is more ideal, but anything 80 or above is deemed acceptable. So I'm going to specifically talk about three issues that can arise with low power. False negatives, inflated effect size estimates, and low positive predictive value. And I'm going to start with false negatives. So this is probably what most people think of when they think of issues of statistical power. So a false negative is when there is an effect out there in the population, but my particular study fails to reject the null hypothesis. So I do not find a statistically significant result. So for example, if I ran my men versus women height study and I did not find an effect, that would be an example of a false negative. The lower the power of my study, the more likely it is I find a false negative. So because power tells me how likely it is that I find an effect when the effect is really there, one minus my power is going to be the percentage likelihood that I don't find an effect. So if I have 80% power, that means 80 of the studies will come back significant, 20 of them will come back as non-significant, so false negatives. If I go down to 30% power, that means that 70% of my studies are going to be showing false negatives, which is a problem because I may think that that means the effect isn't there, abandon the line of research unnecessarily, when perhaps there was really an effect there that I could have followed up on. So the next thing I'm going to talk about is inflated effect sizes. So for those of you who don't know, an effect size is just a measure of magnitude. So for example, in the study comparing men and women's heights, the effect size would be the actual difference in height between men and women. If I'm running a treatment versus control study, the effect size would be some measure of the difference in effectiveness between my treatment and my control. So there is one effect size out there for any effect in the population, the true effect size. However, when I run a sample in a particular study, every sample is going to have a slightly different effect size that are going to create a distribution that'll be centered around that population true effect size. So to give you an illustration of what this looks like, here's a graph of the distribution of samples around a Cohen's D of 0.5. A Cohen's D is just a standardized effect size for mean differences. And a Cohen's D of 0.5 is a medium sized effect. So you'll notice a couple of things. First, these light blue bars are showing the frequency of each different effect size. This was done by running 10,000 simulations from a population that had a true effect size of 0.5. And so what you'll notice is that these create a nice normal curve. And that the center of the normal curve, the mean of that distribution is right there centered at that true population effect size of 0.5. So this is basically what the sampling distribution for a particular Cohen's D looks like. Now, the power of a study is not going to affect the mean of that distribution. So no matter what the power, the mean should always be around 0.5, because that's my true population value. However, it is going to affect two very important things. It's going to affect the shape of the distribution and the areas of that distribution where the Cohen's Ds would be statistically significantly different than 0. So to give you an idea of how it affects shape, here are two distributions. So the one at the top is the one you saw before. That's the distribution that corresponds to samples that had 30% power. So I was drawing samples that were of a particular size to get a power of 30% each time. The one on the bottom represents the distribution if I was drawing samples that had 90% power. So you'll notice a couple of things. One, the red line which is at the mean is basically right in the same place for both of them. It's right at that 0.5 true effect size. And both of them actually follow a normal distribution. But what you'll notice is the spread is very different. So when I have less power, the normal curve is actually much wider. And so the range of possible effect sizes is much greater. And I can get much more extreme effect sizes. When I have more power, that distribution actually narrows. And so the range of effect sizes shrinks, and I'm gonna get more effect sizes that are really centered around that true effect size. So I also mentioned that my power is going to change the portion of the curve that would be statistically significant. So what these shaded in dark blue sections correspond to is all the effect sizes in these curves that correspond with a statistical test that had a p value of less than 0.5. So the statistical test came back as statistically significant. So what you'll notice is that when I have lower power, when I have 30% power, much less of this curve is shaded in. Much less of it is going to pass that statistical significance mark. And what does pass is actually kind of shunted off into that one tail. So it's more extreme effect size values. And actually the true effect size that d of 0.5 wouldn't be statistically significant with 30% power. If we go down to 90% power, a much larger proportion of this curve is shaded in. Which means that many more effect sizes from this curve are gonna come back as statistically significant. What you'll notice is that, yes, we still get effect sizes from that tail end of the curve, but many of the effect sizes are right around in the midsection of this curve. And so effect sizes close to that true effect size are going to reach statistical significance. So why is this important? Well, from previous research, we know that many fields have a bias towards only publishing positive results. So what this graph shows is the proportion of statistically significant results in the published literature across a wide variety of disciplines. And what you can see is that even in space science that has the lowest proportion, that really is right around 65, 70% positives. All the way up to psychiatry and psychology, which have about 90% of the studies being published showing statistical significance. And so what that means is that effect sizes from the shaded in areas of the curve are far more likely to get published. And so what that is going to do is basically if I have studies that have 30% power, any effect sizes in this non-shaded area, I may run a study and I may get that effect. But if it doesn't get published, it's going to be very hard for anybody to actually find that. The published studies, because they're more likely to show significant results, are mostly going to be from this section of the curve. And so the individual effect size estimates are going to be greater than the true value, right? They're going to be inflated. When I get down to this 90%, right? I still could be getting statistically significant effect sizes that are greater than .05, but I at least have the ability to get effect sizes published that are at or even below the mean effect size. The true effect size out there in the population. And so this is going to do two things. It means that effect sizes up here, if they're statistically significant, are going to be greater than the true value. Here they might be. They could be at the true value. They could be a little bit lower. But it also tells us that if you were to average across all the published effect sizes, assuming that we're in a field where the vast majority are positive, that the average effect size from this top distribution is going to be much greater than the true effect size. Whereas this one, the average effect size of published studies, is going to be much closer to that true effect size. So to give you an idea of what this looks like in graph form, this is a graph that was based off of a few different simulations. So basically, the true effect size out there in the population was a Cohen's D of 0.5, once again. And so for each of these different statistical power levels, I ran 10,000 simulations, pulling out 10,000 effect sizes. I then averaged across all the effect sizes within those 10,000 simulations that were statistically significant. And so what these averages represent is the average effect size of reported Cohen's Ds. And by reported, I mean, in this case, Cohen's Ds that were associated with a statistical test, where P was less than 0.5. And so what you notice is that when the statistical power is high, when it's at that 90%, you get an average effect size that is very close to the true effect size. That effect size is almost at that 0.5 level. But as statistical power goes down, the average effect size gets larger and larger and larger until it's all the way up over 1.75. So there's this huge inflation of the effect size that happens as statistical power decreases. So why is this a problem? Well, it's a problem for two reasons. The first is that we'll tend to overestimate the effectiveness of our treatments or the differences between groups. Sometimes the magnitude of the effect size, we may be more or less interested in that. But at other times, the magnitude of the effect size is what we're really interested in. And that may actually change our behavior or influence decision-making, at which point it's very problematic if we're getting an overestimate of that effect size. Additionally, because oftentimes people will use published studies to power future studies, if the effect sizes in the published literature are implated, that will actually mean that future studies are actually not properly powered for the effect sizes that are in the population, so the ones they're actually searching for, which means that actually low power will be perpetuated across future studies because the published studies are not a particularly good estimation of that true effect size. So the last thing I'm going to talk about is something called positive predictive value. So the positive predictive value is the probability that a positive result, so a statistically significant result, actually represents a true positive. So if I get a significant result, is that effect really out there in the population? So this is the equation for positive predictive value. It's a little complicated, but basically what you need to know is that this one minus beta, this is the statistical power of my study. This is my alpha level in many fields that's just default set to 0.05, and this OR is the odds ratio. The odds that my hypothesis was true to begin with before I ever did my study. Oftentimes we don't know what the odds of our hypothesis being true are when we do our study, but what we can do is look at the positive predictive value for a range of odds ratios and look at how power affects the positive predictive value across those ranges. And so that's what this graph is. Basically what you can see is across this whole range of pre-study odds ratios, the lower your statistical power, the lower the probability of a statistically significant result representing a true value out there in the population. And so basically what this is telling you is if you run underpowered studies, it actually makes it difficult to draw conclusions even when your study finds statistically significant results because that low power is going to make it less likely that your statistical result actually maps onto a true effect out there in the population. And so this may lead to actually wasted resources over time due to following up on false positive studies. So you get a significant result. You think, great, there's something really real there and you run a bunch of studies to follow it up. The lower the power, the more likely it is that you're actually just chasing noise rather than trying to follow up on an effect that really is true in the population. And so just to recap, I talked about three issues related to low statistical power. The first was the increased likelihood of false negatives. So lower powered studies mean you're going to miss more true effects, inflated effect sizes. So the effect sizes reported in the published literature because of that positivity bias are likely to be larger than the effect really is. And also decreased true positives. So that decrease in the positive predictive value were related to low powered studies. And so all together what this means is that low powered studies actually make it harder to draw conclusions from positive or negative results because it's difficult to figure out whether a positive result is a true positive or a false positive and whether a negative result is a false negative or a true negative. And so you really don't get all that much conclusive information out of a low powered study when it's taken on its own. So as with all our videos, the slide decks for its presentation as well as the R code to create some of the various graphs I showed are all available on the OSF so you can download them, play around with them. This particular project, if you go to our main website, click on consequences of underpowered studies. You can get to the section with all the materials. I'll also be adding a section with suggestions for further reading if this is something you're interested in and want to get more information from articles that are particularly informative on this topic. And as always, if you have any questions, comments, ideas for videos that you think would be particularly helpful, please don't hesitate to contact us. You can either email us at stats-consulting at cos.io, contact at cos.io, or you can feel free to comment on YouTube videos or in the comment section of the OSF. Thanks.