 Hello and welcome back to the Sports Biomechanics lecture series, as always supported by the International Society of Biomechanics and Sports and sponsored by Vicon. I'm Stuart McCurlay-Naylor from the University of Suffolk, and today I'm really lucky to be joined by Kristin Sinane, who is associate professor at Stanford University. She's also a statistical editor for the journal Physical Medicine and Rehabilitation, and she authors a statistical column for that journal called Statistically Speaking, for which the link is down below in the description. So, when I was putting together and kind of planning this series, Kristin was part of my dream list really of, wouldn't it be awesome if, so yeah, I was really pleased that Kristin agreed to do a talk for us on statistics within sports science. I heard a little bit about what she plans on presenting and I'm really looking forward to it. So hopefully it's something we can all benefit from with such an important topic for everything really involved in reading and conducting research. Yeah, if you've got any questions, please pop them in the live chat and I'll direct them on to Kristin at the end. Yeah, without further ado, thanks very much for joining us Kristin, over to you. Thanks, Stuart. Thanks so much for inviting me to speak. And I thought today that I would actually do sort of a nuts and bolts lecture. So I'm going to review some fundamental concepts and statistics that I'm sure you've all seen before, but I'm hoping to present them in a slightly different way than you may have seen them before, in order to help you get a deeper understanding of these concepts. And for today's lecture, I'm going to be using some simulated data from a hypothetical trial. So imagine that we have a randomized placebo controlled trial in competitive male distance runs. And we are randomly assigning 10 of them to drink a placebo drink for four weeks, and 10 of them to drink cherry juice as a recovery drink for four weeks. The outcome is their performance in a 5k time trial. And imagine that we had them run the time trial before the intervention, so we can adjust for any baseline differences that might arise in a small trial. All right, so I'm showing you the results here these are simulated results so these are not real data, but you can see them plotted on the right here. And you can see the treatment group did a bit better than the controller. So imagine the treatment group was a time of 1445 and mean of the control group was 1505 those again are adjusted for any baseline differences between the words. That means that there is a difference in means of negative 20 seconds and this is the statistic that we're interested in. Negative numbers mean that the treatment group ran faster than the control group and in running you want to be negative negative meaning a lower number. The standard deviation and running times within the groups here is 31 seconds. And so that means that on average if you look at the treatment group that men were on average roughly about 30 seconds away from the mean of 1445. These three numbers here the standard error the confidence interval and the p value those are inferential statistics and those are the numbers that I'm going to try to unpack for you today. So let's step back for a minute though and talk about what do we even mean by statistical inference so statistical inference is the process of making guesses about the truth from a sample. So we have some unknowable truth some number in a large population that we want to know and we call that number the population program. That is the true number in the whole population. For example, in the US we have an upcoming presidential election in November, and we'd all really really like to know how many US voters plan to vote for Joe Biden. Unfortunately we can't actually know that until election day because we can't ask everybody until election day. So what do we do instead well we take a sample of US voters, maybe we sample 1000 US voters, and we calculate the number that we are interested in we calculate that in the sample. So we figure out the percent of our sample that's planning to vote for you for Joe Biden. We then use that number in the sample which is what we call a sample statistic to make some guesses about the population parameter. We know of course that our guess is going to have some uncertainty because we haven't measured everybody. So even if we do the perfect study, you know, we do great random sampling, we have no measurement error, nobody lies to us. Even in that case, our estimate is going to have some uncertainty. And that's just because we haven't measured everybody and so we're going to get some sampling here. And that just means, you know, maybe in a given sample we might get a few extra Joe Biden voters or a few fewer Joe Biden voters. A lot of the business of statistics is trying to quantify that uncertainty in our guess. And a key measure that we use to quantify that uncertainty is something called the standard error. And I'm sure you've all heard that term before. But when I talk with young scientists and students who maybe had a few courses in statistics, if I asked them what standard error is, I find that most people actually don't really have a good concept of what standard error actually needs. So I thought it might be useful today because this is such a fundamental and important concept in statistics to try to go in a little bit of depth and unpack what standard error is for you. So if we're being very precise, the standard error is a measure of the variability of a sample statistic. And this is actually a really hard concept to get your head around, because if you start to think about it in detail, you can think back to like that cherry juice trial that I was talking about. In that trial, the statistic we were interested in is the difference in these and we did the trial and we got a value of negative 20 seconds. And now I'm telling you that that sample statistic has variability. Well, what the heck do I mean by that? There's one number. It's negative 20. It has no variability. So what do I mean when I say that that sample statistic has variability? So what you have to understand is a standard error is a completely theoretical construct. It's something that we get by statisticians thinking, sitting around and thinking a thought experiment. And they think how much would the value of my statistic fluctuate if I could somehow repeat my study over and over again with different samples of the same size. And if you knew that, if you could somehow repeat your experiment many, many, many times, like if you could somehow repeat your cherry juice trial with different samples of 20 male competitive distance runners, and you could somehow do that many, many, many times. And you could look and see how much that difference in means bounces around from sample to sample. Well, then you would know how much uncertainty there was in your guess. Unfortunately, of course, we don't have the resources to actually go out and repeat our experiment many, many, many times. So what we do is we figure out what the standard error would be in a theoretical way. We use math or computer simulation as I'll show you in a minute. So there's two ways to calculate the standard error. Commonly, we just use math. We derive formulas for the standard errors of common statistic. We do this theoretically. And so, for example, if you were interested in a proportion, like, you know, querying US voters about who they plan to vote for, you might calculate a proportion at the end of the day. Well, the formula I'm showing you here is for the standard error of a proportion. The statisticians have already figured out the formula for that standard error, so we don't need to re-derive it every time we can just use the formula. Or if you're talking about an odds ratio, the standard error of an odds ratio is a different formula. There are many standard errors. There isn't just a one standard error. Or if you're talking about the difference in means, well, that's another different formula. And standard error, you can see it's many different things. There's a different standard error formula for every different type of sample statistic. So sometimes people think standard error is just, you know, the standard deviation divided by the square root of n. Well, that's the standard error for a mean, but there's many, many, many standard errors. We already have these formulas for all the common statistics we would use. So when we go to do data analysis, we just rely on these formulas. However, there is another way to calculate standard error, which is called a computer simulation. And when I do teaching about standard error, I like to rely on computer simulation because I think it's a lot more intuitive. If I show you pictures from a computer simulation, I think that's a lot more intuitive than if I show you a bunch of mathematical equations. So I'm going to illustrate everything today using computer simulation. In a computer simulation, we can literally repeat the experiment over and over again. We just do it virtually. But we can actually see what happens if we run our terri juice trial many, many, many, many times. So it's a lot more intuitive. We can actually observe the fluctuation in the statistic. So what I did was I set up a simulation to mimic our hypothetical terri juice trial. And in fact, I used the simulation to generate the simulated data for that time. And it goes like this. So in my computer, what I kind of do is I set up a virtual world and I assume that there's just sort of an unlimited population of meal, competitive distance runner. And for the purposes of this particular simulation, I am going to assume that there is no effective terri juice. It does not do anything on running performance. It doesn't help you. It doesn't hurt you. And so I have kind of this unlimited population of treated runners. They have all drank the terri juice. An unlimited population of control runners. They haven't drank the terri juice, but it doesn't actually matter. They all have the same times. Their running times have the same distribution. And I'm going to assume that 5K times in this population are distributed in a normal distribution. The mean is 15 minutes with a standard deviation of 30 seconds. And that's true for both groups. I'm then going to run in my computer a virtual trial. So I am going to sample 10 treated runners and 10 controllers. And what that means in the computer is that I'm going to randomly generate 10 times from that distribution I just described. Those become my treated runners. And then I'm going to randomly generate another 10 times for my controller. On average, treated and control runners, you know, on average have the same time. In any given sample, when I'm only sampling 10 and 10, I might get, you know, a few really fast runners in my treated group just by chance. And I might get a few really slow runners in my control group just by chance. So I might actually observe a pretty big difference in the means of the groups in any given sample. So I do this, I calculate the statistic I'm interested in, which you hear is the difference in means between the groups. I do that once in my computer, but then this is virtual. So I can do this as many times as I want. So I then do this a large number of times. I repeat this 100,000 times is what I choose for this particular simulation. That's a totally arbitrary number. I could have done this a million times. I could have done this a thousand times. I chose 100,000 just because it's a big number, but it runs quickly on my computer. It only takes a few seconds on my computer. I can actually have the results of 100,000 virtual trials and I can observe how much does my difference in means actually bounce around from trial to trial. To make this a little bit more concrete, I'm going to start by just showing you the results of the first 10 virtual trials. So this is just the first 10 runs of my simulation. And I've graphed them over here so that you have a visual. This shows you just the first 10 results from my simulation. What we're looking at is the difference in means from each one of these virtual trials. So for example, in the first trial I get a difference in means of negative 9.1 seconds. Negative numbers means that the treatment group did better than the placebo group. So I go ahead over here and I plot this. This little dot is at negative 9.1. In the second trial I got negative 10.7 that ends up here. In the third trial I got positive 2.3, meaning the placebo group did a little better. The fourth trial I get negative 22, which is very similar to what we saw in that hypothetical cherry juice trial. And you can see that the red dots bounce around zero. Well, zero is the true value here because I made my virtual world that cherry juice does nothing. So zero is actually the true difference in me. And you can see that these red dots actually bounce around a lot. And the amount that they bounce around zero, well, that is what we mean by standard error. Standard error is a measure of how much these red dots fluctuate around zero. Roughly it's the average distance of those red dots from zero. You can kind of eyeball it here. Like if you look and if you kind of have a guess, what's the average distance of the red dots from zero here? Probably somewhere around 10 to 15. That's what our standard error is going to be here. Somewhere between 10 to 15. Yeah, that's just a really rough estimate. But I then ran this simulation 100,000 times and I'm getting 100,000 red dots. Now, if you plot 100,000 red dots on this particular plot, it's a little hard to see much because we got a lot of red overlapping. One thing you can pick up though from this plot is very, very occasionally we got differences between the groups of 50 or 60 seconds, but you can see that that was actually pretty rare. And then we get a lot of smushed together red in the middle. You can do a little better in the visualization if you do what's called a density plot. So I've now changed this to a density plot, but hot areas in the middle, those represent where we've got lots of dots. And so you can see because there's actually no difference between the groups, we cluster around zero by level. This is actually better visualized though, probably in a histogram. So what I'm going to do now is switch to show you these data in a histogram. So this is a histogram of my 100,000 differences in means that I got from my 100,000 virtual trials. And the x-axis here again is the difference in means, negative numbers means the treatment group ran faster, positive means of the control group ran faster. And what a histogram shows you, it just shows you the frequency with which different differences in mean values occurred in all of my trials. And you can see that we're on this now beautiful normal curve by taking a big number of repeats for my simulation, I get this nice, beautiful normal. And it's centered around zero, what that's expected because of course the true value is zero. So we get a lot of results clustered around zero. However, you'll notice that just by chance, sometimes I got differences as big as 25 seconds between the groups, 30 seconds, even 50 seconds differences between the groups. These are arising just due to chance fluctuation. And you can see, remember in my, in my cherry juice trial, I got a difference of 20 seconds between the groups. Well, actually, that's not that hard. That can happen just by chance fluctuation pretty easily. If we want to calculate the standard error here, what we have to remember is the standard error here, it's the average distance of these statistics from zero. Well, that's just the standard deviation of this normal curve. What I did is I took, I've got my 100,000 values in a data set in my computer, I just calculate the standard deviation of those 100,000 value. It comes out to be precisely 13.4. That's the average distance from zero here. We are talking now about the variability, the standard deviation of a statistic. Remember, this is the difference it means is a sample statistic. When we talk about the variability of statistic, it gets its own special name. So now we're not talking about the standard deviation of a trait like how variable running performances are, we're talking about the variability of a statistic. It gets its own name. It's called a standard error. So that's what that name standard error means. It's specifically when we're talking about variability of a statistic. So the standard error here comes out to be 13.4. I've calculated that through my simulation. I could also calculate it mathematically. So let me just show you the mathematical formula here. Somebody has already come up with this mathematical formula. When I apply the mathematical formula here, I'm basically taking the standard deviation and running times. That was 30 seconds. I square that, so I get 900. I divide by the sample size and I have to do this for both groups because I have two groups. When I calculate this number, I also get 13.4. That's good. We hope we get the same answer from both approaches. And indeed we get 13.4. What I did next is I changed up my simulation just slightly. So I'm using the same simulation as before. But now I'm changing one parameter. I am increasing the sample sizes in the group. I'm going from 10 per group, a pretty small study, to 50 per group. I'm sure you can all guess what's going to happen to the standard error now, right? So I'm sampling more people. That means there's going to be less uncertainty in my guess. So I would expect the standard error to become smaller. And that's exactly what happens. So here is the result of the first 10 virtual trials of this simulation. And what you'll notice immediately is that the red dots are much more tightly clustered around zero, right? We bounce around zero, but with a lot less fluctuation. That means that the standard error is now smaller as we would expect. I get values of like three, two, six, four, tightly clustered around zero. The average distance from zero here is going to be smaller. And we can compare this visually to when we had 10 per group of 10 per group. And when we have 50 per group, you can see the difference. It's much more tightly clustered around zero. I then plotted this in a histogram so we can look at all 100,000 files. And what you'll notice is we're still in a beautiful normal curve. We're still centered around zero. But our normal curve is now much, much, much newer. The definition of this normal curve is six. It's a much newer normal curve, reflecting the fact that our average distance from zero is smaller. Reflecting the fact that there's less fluctuation in our difference in needs. Our standard error is now six. We can also calculate this mathematically just to show you, you will get the same answer if you use the mathematical formula. All right, I then went ahead and made one more little tweak to my simulation. So I'm going back now to the simulation where I had just 10 per group. And I am changing just the standard deviation in running times. So how variable are running times in this population? I started with a standard deviation of 30. I'm doubling it to a standard deviation of 60. You can imagine if we're talking about like the entire population of male competitive distance runners, actually that's probably a really broad population. And so the standard deviation actually might be quite a bit bigger than 30 seconds in that population. So let's say it was 60 seconds. What do we think is going to happen to standard error? You can probably guess that if you've got more variability in the thing that you're measuring, it bounces around more. You're going to get more chance fluctuation. And so your standard error is going to go up. And indeed, that's exactly what happens. So as before, we're on a normal curve. We're centered around zero, but you can see this is a much wider normal curve. Just by chance in these virtual trials, sometimes we saw differences as big as the minutes between the groups, even 80 seconds, even 100 seconds difference between the groups. This is purely due to chance, just due to chance fluctuation, getting a few fast runners in one group or a few slow runners in the other. The width of this normal curve, the standard deviation of this normal curve is 26.8. We've actually doubled the standard error. And again, you get the same answer if you use the mathematical formula. All right, so we've learned a few things now. Standard error is a measure of the variability of our sample statistic. How much does the sample statistic bounce around if we could somehow repeat our experiment many times? Standard error depends on two components. It depends on sample size, and it depends on the variability of our outcome. When sample size is smaller or when variability is higher, those two things are going to drive a bigger standard error, reflecting the fact that those cause there to be more fluctuation, more uncertainty in our sample statistic. I'm now going to do a slight technical diversion here. So if you'll permit me like three slides, if you've ever wondered or been a little confused about the difference between a T-curve and a normal curve, I thought, just for fun, I'm going to throw that in in a few slides here, because later in my talk, I'm going to start talking about T-distributions, and you might wonder how we got from being on a normal distribution to a T-distribution. So bear with me for three slides here on a little technical divert. Up until now, when I've been calculating standard error in my simulation and in my math formulas, I have plugged in one value for the standard deviation. I have known, because I set up the simulation, for example, what the true standard deviation in running times was. I used a value of 30 in all of my calculations. Imagine though that when we do a study, we don't know the standard deviation ahead of time. We just get some data, and we don't know what the standard deviation in running times is, so we calculate it from our sample. Of course, because we're calculating it from our sample, it's only an estimate of the true standard deviation. In order to calculate standard error, we need to know that standard deviation. Because we are estimating the standard deviation in our sample, that means we are also only getting an estimate of standard error. This is going to add a little bit of uncertainty to our inference about the needs. So for example, remember back to that cherry juice trial. I actually generated from that, you know, the data from that trial from the simulation I just described to you. I happen to know because I set up the virtual world that the exact standard deviation in running times is 30. But when I ran this one virtual trial, I saw in my sample a standard deviation of 31. I'm a little bit off, and I plug in my 31 to calculate the standard error here. I'm estimating the standard error, and it comes out to be 14, not 13.4. So it's always going to be a little bit off. That adds just a little bit of uncertainty to our inference. And so we're going to have to do something to capture the fact that we've added this little bit of uncertainty. To illustrate this just a little bit further, I'm now showing you the results of 20 virtual trials from my simulation when I have 10 per group and also the one where I have 50 per group. I'm not showing you the difference in means anymore. I'm now showing you the standard error as estimated just from the data within each individual virtual trial. So for example, in the first trial, I estimated the standard error to be 10. That means my standard deviation came out a little bit lower than 30 in that particular sample. And you can see there's some uncertainty in my estimate of standard error. The true standard error here was 13.4. I'm kind of bouncing around that. This adds a little bit of uncertainty to my inference. This is particularly important when I have a small sample size, but what I want to point out is that if I have a larger sample size like 50 per group, when I estimate the standard error in each of these individual virtual trials, it doesn't bounce around much. It's really, really close to the true standard error here of 6. And so it actually doesn't make a big difference when we're talking about sample sizes of 100 or so. So this really matters in small sample sizes. How this affects things, how this affects my inference is up until now I've been showing you that when I look at the distribution of the difference in means, I'm actually on a perfect, beautiful normal curve of 13.4. If I pick my difference in means and I divide by a constant, what's going to end up happening is that I stay on the exact same shape as before. I'm still on a perfect normal curve. However, if instead of plugging in 13.4 here, I instead divide by that estimated standard error in each sample. I don't stay on a perfect normal curve. I end up on this shape instead. So what I did now is in my simulation, I divided the difference in means by the standard error I saw in each virtual trial. And I want to kind of flash back and forth here to see if you can see the difference. It's totally subtle. You got to look at the tails of this distribution. But if I kind of go back and forth here and you focus your eyes on the tails, what you'll notice is that you end up with just a little bit more area in the tails of the distribution when you're using the estimated standard error. What's actually happening is that we're ending up on something called a t distribution. A t distribution looks a lot like a normal distribution, but it has slightly fatter tails. This is reflecting that added uncertainty. T distribution, the shape changes depending on your sample size. And as your sample size gets bigger, you're going to look more and more like a normal curve. But when you have a small sample size like just 10 per group, you end up with a noticeable difference in the tails here. What that means is that when we go to do inferences, we're going to be needing to increase the width of our confidence interval a little bit. We're going to need to increase our p values a little bit to reflect this added uncertainty due to the fact that we've only estimated the standard error. So when I start talking about t distributions later, this is where that comes from. All right, so now we understand about standard error. We understand about the uncertainty in our guess. How can we use that value? Well, if we understand the uncertainty in our guess, that allows us to put an appropriate margin of error around our guess. And that's the high level goal of a confidence interval. I want to now talk about the precise definition of a confidence interval. And the precise definition is it says, if you could again somehow repeat your experiment many, many, many times, and you calculate the confidence interval in each one of those experiments, say we're doing a 95% confidence interval. We want to build it in such a way that in 95% of our repeated experiments, we overlap the true value. We catch it in our confidence interval, but we allow that we can miss it 5%. And this is again like a hard concept to get your head around without a picture. So I really like this dynamic simulation that was built by Christopher Magnuson. I am actually going to switch over now to the internet to show you this dynamic simulation. It's a little bit more cool than the simulations that I've been showing you that have been static. So I'm going to have to switch my share screen here. Take me a minute just to switch over to the internet. But I am going to now switch over to the internet and hope that you can all now see the internet and to see this dynamic simulation. This is just a wonderful simulation, which I recommend that you all go and kind of play with it after this talk. I use this a lot in teaching because you can see it's super cool. It is running a simulation in real time. So unlike my little static picture, if you can actually see each virtual trial being run here. So what we are looking at is this is doing virtual trials where we're calculating a sample. The true man is zero this dotted orange line in a note. Every time you see a blue dot thrown down, that's a new virtual trial and the blue dot is the mean in the sample. The line is the confidence. I want you to focus your eyes for a minute just on how much these blue dots are bouncing around zero. That's as we talked about before. That's an illustration of the standard error. I also now want you to look at the confidence intervals. So notice what's happening with the confidence intervals. Most of the confidence intervals are crossing zero. They're crossing our true value. They contain the true value. If we have properly built a 95% confidence interval, that means that 95% of the time we're about 19 out of 20 of these simulations, we're going to cross the true mean and about one out of 20 we're going to miss it. And the simulation colors the confidence intervals red when you miss it. So roughly one in 20 times we're missing zero, 19 out of 20 times we're catching zero. And there's a counter over here that's actually showing you in the long run our coverage. Right now we're at a perfect 95% code. In the long run, that's going to be 95%. This simulation is great because you can play with some of the parameters in real time. So I can actually increase the sample size of my samples. I'm going to crank up from five down to 20. And what you can see happening now is those blue dots cluster more tightly around zero. That means the standard error has shrunk. You'll also notice that the confidence intervals are now much newer. Because there's less uncertainty in our guess, I put a smaller margin of error around my guess. And I will still catch the true mean 95% of the time and miss it 5%. All right, now I'm going to go back to my original sample size. The other thing you can alter in this simulation is you can change the confidence limit. The confidence level. So for example, we can go down to a 90% confidence interval rather than a 95% confidence interval. So what you'll notice happen there is all the confidence intervals shrink. They've got mirror work. And that's because now I just want to capture the true mean 90% of the time. And I'm okay with missing it one in 10 times. So I can make the mirror work and still accomplish that. However, I can also crank this up to a 99% confidence interval. And you see that the confidence interval is just get really why? Because a 99% confidence interval means I want to catch the true mean 99% of the time. And we miss it one out of 100, which means it's going to take a while before we see one of these red ones. And so I have to make it much wider to accomplish that goal. All right, I'm going to switch now back to my slideshow. That's the way that went. Okay. And so anyway, that's just a really cool dynamic simulation. It's a lot of fun to play with. All right. So how do we actually calculate confidence intervals in order to accomplish that goal of catching the true mean the right amount of times? What we do is we take the point estimate. That's the value we see in the sample, like the negative 20 seconds in the cherry juice trial. And we're going to add a margin of error that's composed of the standard error times of value that determines the confidence level. So for a 95% confidence interval, that's usually about two. We're usually just multiplying the standard error by two. Precisely the values that we use for building the confidence interval that correspond to those confidence levels. Typically we use 1.96 if we want to build a 95% confidence interval. 1.64 would be for 90%. We get to be nearer and 2.58 for 99%. Where do these values come from? Well, they just come off of a standard normal curve. So if your statistic is normally distributed, you get to use the values from a standard normal curve. As I mentioned earlier, though, if we're talking about statistics about means and we are on a small sample, we have this little twist that we actually end up on a T distribution and not a normal curve. And so we actually have to use slightly different levels of bounds here for our confidence intervals. Instead of using 1.96 to build a 95% confidence interval, I'm using 2.1. I have to make my confidence interval just a hair wider in order to make sure that I catch the true mean 95% of the time because of this added uncertainty due to the fact that I only estimated the standard error curve. So we have to be just a little bit wider. It makes a small difference in small samples. What I did next was I took my simulation and I showed you before where I calculated the difference in needs for each one of my simulations, but we can also calculate a confidence interval using the data in each virtual trial. So what I did here is in each virtual trial, I calculated a confidence interval. I take the difference in means that I see in the virtual trial, which again is bouncing around a bit, and I do plus or minus 2.1 times the standard error that I see in that virtual trial. Again, those are going to bounce around a bit because we're only estimating the standard error from our trial. That's going to mean that our confidence intervals have the widths of the confidence intervals are going to vary as you can see in the next picture. I'm going to show you now just 20 confidence intervals from 20 virtual trials, and I plotted them here. The blue ones overlap zero. The red one missed zero. So the blue dot represents the difference in means that I saw in the trial that I've shown to you before, but now I've added the confidence interval from that time. And I missed it one in 20 times. Well, that's good. That's exactly what we want to see in 95% confidence interval. I want to touch the true mean 95% of time, but I get to miss it one in 20 times. Just for fun, because this makes a pretty picture, I then plotted 100 confidence intervals, starting to get a little cramped on that graph. But you can see that actually I missed the true mean 1, 2, 3, 4, 5, 6 out of 100 times. So I get 94% coverage in the long run. This will be 95% coverage. I then took the data from my hypothetical cherry juice trial, and I calculated the confidence interval for that specific trial. I saw a difference in means of negative 20. I do plus or minus 2.1 times the standard error that I saw in that trial, which was 4 key. Then I get a confidence interval of negative 49 to positive 9. So how do I interpret those results? What is that confidence interval actually tell us? So it tells us actually a number of things. You can get a lot of pieces of information out of the confidence interval. For one thing, it tells us that effects anywhere from the 49-second improvement in running times, all the way up to a 9-second worsening in running times, all of those effects are plausible effects. It also tells us something about how precisely we've estimated our effects. We have a really wide confidence interval here that reflects the fact that this is an imprecise estimate. And you're going to get imprecise estimates when the standard error is high, when sample size is small, or variability is high. Confidence intervals also tell us something about hypothesis testing. So just by looking at a confidence interval, there's a limited amount of information that you can get about hypothesis testing. Hypothesis tests and confidence intervals have a one-to-one correspondence. So just by looking at the confidence interval, I can do a quick test of a null hypothesis. For example, what if I wanted to test the null hypothesis that the effect size is zero, meaning that cherry juice does absolutely nothing to running times? Zero is contained in that 95% confidence interval. So I know that if I run that null hypothesis test with a null of zero, I know that I am not going to be able to reject that null at a P of less than 0.05. I'm not going to be able to reject it at P of less than 0.05. 0.05 corresponds to the fact that this is a 95% confidence interval. We're not limited, though, to just testing null values to zero. There's no reason we can't test other null values. So let's say I wanted to test the null value of positive fights, a 5% worsening in running times. For some reason, that's the question I'm interested in. I know just by looking at the confidence interval that since 5 is within that confidence interval, I'm not going to be able to reject that null value at a significant level of 0.05 either. I know that if I run that hypothesis test, my P value is going to be something higher than 0.05. I can't, just by looking at the confidence interval, talk about the exact P value. I could with some more math, but at least I know the P value is going to be higher than 0.05. Similarly, if I wanted to test the null that there's a 5 second improvement in running times in effect size of negative 5. Well, negative 5 is also in the confidence interval, so I know I'm not going to be able to reject at a level of 0.05. My P value is going to be higher than 0.05. So there's some limited information that you get about hypothesis testing just from looking at the confidence interval. All right, so that now brings us to our discussion about P values and hypothesis testing. When we were building confidence intervals, we used the standard error to help us build an appropriate margin of error around our guests. For the true effect size. When we're talking about P values, we're also going to use the standard error, but we're going to use the standard error to help us distinguish real effects from chance fluctuation. And I want you to recognize that the goal of a P value is a much more narrow goal. It does a lot fewer things than a confidence interval does. A confidence interval does more things. A P value has a much narrower goal. Precisely, the meaning of a P value is it's the probability of the observed outcome and everything more extreme if the null hypothesis is true. Now that is a very wordy description. It's also really hard to get your head around just by listening to those words. So I'm going to try to illustrate this in a minute in pictures, but focus on the fact that it's a probability, but it's a probability about our data about what we see in our data. We're going to actually calculate a P value. So to calculate a P value, we're going to use all the same elements that we used in calculating a confidence interval. That's why there's a one to one correspondence between us. We take the sample statistic, the value that we observed in our sample, like the negative 20 seconds in the cherry juice trial. We're then going to subtract the null value. Now, when we built confidence intervals, we did not have to specify a null value. But to calculate a P value, we do have to specify a null value. We then divide by the same standard error as before. Remember, this standard error represents the amount of fluctuation that we expect in our statistic just due to chance. This gives us a P value, high P values correspond to small P values. Conceptually, what we're doing here is we're taking the value that we saw in our sample, and we're seeing how far is it from the null value. What is the distance between those two values? If that distance is large enough, if it exceeds the amount of fluctuation we expect just due to chance by enough, then we're going to get a high P value. That's going to correspond to a small P value. What that means is we're going to reject our null value. We're going to look at our data and say, well, gee, our data are really discordant with the null value, really far from what we'd expect, even if there's a lot of, you know, even accounting for the chance fluctuation. And so that means our data don't seem to jive with the null value. So what we're going to do is we're not going to reject our data. We're going to reject the null value and say that we have evidence against the null value because it doesn't match our data. It's discordant with our data. That's the general idea. Let's put some numbers to that. So going back to this hypothetical cherry juice file, the null hypothesis in that trial might be a common null hypothesis that we might want to test. The effect size is zero. This null hypothesis means that cherry juice does absolutely nothing. So let's say we set up our trial to test this null hypothesis. The effect size is zero, cherry juice does nothing. To test the hypothesis using our data, we're going to run a just a standard T test. You're all probably familiar with T test. And again, all that we're doing is we're taking the observed minus the null to see how far apart they are, dividing by the standard error. That's our amount of chance fluctuation. And when we do that here, we do negative 20. The null value is zero, so minus zero, divided by 14. I get a T value of negative 1.4. I'll show you in a minute that corresponds to a p-value of 18%. What does that p-value tell us? Well, it tells us that if cherry juice is absolutely does nothing, with no effect on running, then the probability that we would see what we saw in our cherry juice trial, or something more extreme farther from zero, a bigger difference between the groups, is actually 18%. In other words, it's totally plausible that we would have seen an effect like this. It's not that surprising. It's not that exciting that we would have seen a 20-second difference in our groups. It would happen 18% of the time if we could repeat our experiment over and over again. And so let me now illustrate that visually. So again, I ran my simulation now, and I've got my 100,000 virtual trials, and I took the data from each trial and I calculated a T statistic. I ran a T test and each one of those trials, and I plotted the histogram of those T statistics here. In my simulation, I set it up so that the null was true. Remember, in my simulation, cherry juice does nothing. The true value is zero. So it's showing us how much we expect to see a difference between the groups just due to chance fluctuation. Well, how often in my trials did I get T values of negative 1.4 or smaller farther from zero? It actually happened 9% of the time. 9% of the time I got T statistics this big or bigger. So it wasn't that surprising. And in fact, I also got 9% of the time T values of positive 1.4 or bigger farther from zero. These are cases where the control group beat the treatment group by that amount. When we're calculating the P value, we are looking at our observed outcome in everything more extreme so that it includes it more extreme in both directions. So this is what we call a two-sided P value, and it comes out to be 18%. In other words, our data are totally consistent, concordant, compatible, however you want to express it, with the null hypothesis. It's not surprising that we would see a difference this big. Formally, if we're going to write out the formal steps of a hypothesis test, that's why the null, that's that the effect is zero. Our alternative here is that the effect is not equal to zero. So just that the cherry juice does something to running performance in either direction. When we do a formal hypothesis test, we have to set an alpha level or significance level. Commonly chosen alpha level is 5%. You are not stuck to that level, but I'm going to go with that for the moment, because that's very familiar to people. Clearly in this trial, we're not going to meet that threshold. Our P value is much higher than 5%. So what does that mean? Well, that means we're not going to reject the null hypothesis. We are not convinced that our data are incompatible with the null hypothesis. We don't have evidence that cherry juice effects performance. That's not a very satisfying result, because, you know, all we're saying is we don't have evidence that cherry juice effects performance. We're not going to conclude anything about whether the null hypothesis is true or false. We're just going to say we don't have evidence that cherry juice effects performance. But that's all we can get out of these results. I want to say a little bit more now about the choice of that alpha level, the significance level. P is less than 0.05. That's the common threshold that lots and lots of people choose. It's probably overused. I want you to understand now that nothing magical happens at 0.05. Like when you go from 0.049 to 0.051, nothing magical happens. 0.049 and 0.051 are virtually identical. We use this as a threshold. Just realize that there's nothing magical at 0.05. Using a threshold of 0.05 actually gives us a fairly weak level of evidence too. I think people think that the result must be really robust if you get P is less than 0.05. Well, really not. It's not a very high level of evidence actually. P is less than 0.05 is not the only choice either. You can set your alpha level at other things. You could cheat 0.01, for example. Let's talk just a little bit more since we've got the simulation going about what an alpha level of 5% actually means. What does that alpha choice actually get me? What does it tell you? What I've done now is I've taken my simulations. I've got my 100,000 virtual trials. I've run now 100,000 T-tests. I've generated the P values now from my virtual trials. I'm showing you the histogram of that here. When I looked at the differences in means, they followed a nice normal distribution. When I looked at those T-statistics, they followed a T-distribution. When I generate the P values, however, I get a uniform distribution. If the null hypothesis is true, as in my simulation, when there's no difference between the groups, it's equally likely that I will get P values anywhere from 0 to 1. We call this a uniform distribution. What that means is if I choose an alpha level of 5%, if the null hypothesis is true, just by chance, I'm going to get P values between 0 and 0.05, 5% of the time. If the null hypothesis is true, then I'm going to have a 5% chance of making a false positive, of incorrectly rejecting the null when I shouldn't and concluding that cherry juice has an effect when it doesn't. What you said that alpha level 2, that determines what we call your type 1 error rate. That is, if the null hypothesis is true, I'm going to have a 5% chance of making a false positive error. Up until now, we've been focusing on simulations where there was no effect of cherry juice, where the null hypothesis is true. Of course, I can run a simulation when cherry juice actually does something when the null hypothesis is false. What I did was I went back to my simulation, I'm just reminding you what that simulation looks like here, and I made a change so that the effect size is equal to negative 10 seconds. In other words, cherry juice truly improves running times by 10 seconds. How I do that in the simulation is I set the mean for running time for the control group at 15 minutes, but I set it at 1450 for treated runners, so there's a 10-second improvement due to the cherry juice. I then run my 100,000 trials, and I'm outputting again the p-value from these trials now, not the difference in this. Let's look at the resulting distribution of p-values. Here's 100,000 p-values for my 100,000 virtual trials. What you'll notice right away is now my p-values are still left. I get a lot more small p-values, and that's because, of course, there's a real difference here. We actually have a real difference to detect here. How often did I get p-values now between 0 and 0.05? How often am I going to reject my null hypothesis and conclude that cherry juice has an effect using an alpha level of 5%? Well, it was now 18% of the time. So when I run these trials using an alpha level of 0.05, 18% of the time I'm going to get a significant p-value that is p-value under 0.05 here. Well, that's my statistical power here. My statistical power is what's the probability of seeing a significant result of correctly rejecting the null hypothesis when I should, when there is actually an effect to find, how often do I find it? Not surprisingly, since this is a study with only 10 people per group, we have really poor statistical power to detect an effect of this size. I only have 18% power here. All right, just for fun, since I've got this simulation set up, the next thing I did was to change the simulation such that I had different true effect sizes. I'm altering the true effect size from negative 5 seconds, a 5-second improvement in running times, all the way up to a 100-second difference between numbers. And I type that in my power now in each one of these simulations by looking at how often did I correctly reject the null hypothesis, that is, getting p-values under 0.05 if I'm using an alpha level of 5%. And we already saw that if it's a 10-second improvement in running times, if that's the effect of cherry juice, I only have an 18% chance of detecting that. In order to get up to 80% power, actually, the effect size of cherry juice would have to be a 35-second improvement in running times. Now, that's kind of an implausibly large effect size, I think, for cherry juice. A 35-second improvement in your 5K time, that's huge. I think there's probably very little way that cherry juice has that effect on your running times. So in other words, this study is not adequately powered. This study is not adequately powered to detect any realistic effect size that cherry juice might actually have on running performance. That's totally not surprising, because it's a study with 10 per group. I think we can all guess that this is going to be an underpowered study. All right, so another thing I want to bring up about hypothesis test is, up until now, we've been using a null value of zero. But there's nothing that says you have to use a null value of zero when you're running a hypothesis test. You can choose a different null value, and everything works just the same. So let's look at if we were doing a slightly different trial with a slightly different goal, a slightly different research question. What if we set up our trial to do what's called a minimal effect test? So maybe instead of wanting to test against zero, maybe we want to test whether or not cherry juice improves running performance by some minimal amount, say five seconds. So maybe you're a coach, and you're thinking about making all of your runners drink cherry juice. Now, cherry juice is costly, right? To me, it's not particularly palatable. Maybe it gives people cramps. It's kind of a burden to have to drink this every day. So you don't want to force all your runners to do this if cherry juice doesn't have at least some minimal effect on running performance. And we're setting the minimal effect here at five seconds. If you want to do this test, your null hypothesis is actually different. So before we test it and all hypothesis of the effect is zero, the null hypothesis for what we're calling a minimal effects test is actually that the effect is greater than or equal to negative five seconds. This might take a minute to get your head around, but what we're saying is I want to test the null hypothesis that cherry juice does not improve running performance by at least five seconds. Remember, the null hypothesis is usually the opposite of what we're trying to prove. So we're going to start by assuming that cherry juice does not have a meaningful effect on running performance. And let me write out the null hypothesis and the alternative a little bit more formally here. We start by assuming that cherry juice has no clinically meaningful benefit. That means it has an effect higher than negative five, negative five meaning in five second improvement in running performance. The alternative, this is a one-sided test. The alternative then is that it improves running times by more than five seconds. That is that the effect is smaller than negative five, more negative, better improvement in running times. So in other words, if we can reject the null here, we would conclude that cherry juice has a clinically meaningful benefit. Now, of course, you don't decide what your hypothesis is and what your smallest effect size of interest is after the fact, after doing the study, you would set up your study with a minimal effective interest in mind with doing a minimal effects test in mind. So you set this up before you do your study. But let's say that we actually had set up our cherry juice trial with a minimal effects test in mind. And we had set ahead of time that we wanted to see at least a five second difference in improvement in running times. How would I actually analyze the data from that trial then? Well, it's actually not really any different than when we choose a null effective zero. We're just doing another hypothesis test. The only difference is we've chosen a different null value. Otherwise, there's nothing profound about doing this hypothesis test instead of the common one we do testing a null effective zero. How would I actually analyze the data? Well, my null hypothesis here is that my effect is greater than or equal to negative five. That's actually a range of values, but we're just going to choose the boundary case. Because if you choose the boundary case, if you choose negative five as your null value, that's going to give you the smallest possible p-value. In other words, if you fail to reject your null hypothesis choosing negative five, if I choose any other null value that's bigger, I'm also going to fail to reject my null hypothesis. We're just going to plug in the boundary case here for the null value. Everything else is the same. I'm calculating my t as my observed minus my null divided by my standard error. Nothing changes. So I just take my negative 20. I subtract negative five, meaning I'm actually adding five dirt. Divide by my standard error. I get a t-value of negative 1.07. Because I have set this up as a one-sided test, I only care about effects in one direction. I'm going to go to my t-chart. I'm going to get my p-value. It turns out to be here to be 15%. This is a one-sided p-value. What does that tell us? Well, that tells us that if cherry juice has no meaningful benefit, the probability of seeing an effect this big, negative 1.07 or smaller for our t-statistic is 15%. In other words, it's only possible that we would have seen this just due to change. So I took my simulation now and I set the effect size to be exactly negative five. And I re-end my simulation again. And I posited the t-statistic in each virtual trial. I got the exact same t-distribution. So how often did I get values of negative 1.07 or smaller? Well, I got them 15% of the time. This means our p-value is 15%. Again, I'm not interested in the other tail. I'm not interested in distinguishing between when cherry juice has a negligible effect versus a harmful effect. So I'm only setting this up as a one-sided p-value. I only want to look at whether it's beneficial or not beneficial. So my p-value comes out to be 15%. Well, that's pretty high. In other words, it's not that surprising. I would see a statistic this big. So if we run my formal hypothesis test here, again, if we use an alpha level of 5%, we don't have to do that. But let's say we choose a 5% alpha level. We are not going to reject our null hypothesis. We do not have evidence that cherry juice improves performance by a meaningful amount. Now, the problem with p-values is they're often misinterpreted. It's easy to misinterpreted p-values because they are a little bit tricky. The concept is a little bit tricky. So I've listed here a whole bunch of misinterpretations of p-values. So what is a p-value not? A p-value is not the probability that your result arose due to chance. It's not the probability that your result is a false positive. It's not the probability that the null hypothesis is true. All of those statements are just slightly different ways of saying the same thing with slightly different ones. That means if you get a p of 0.05, this does not mean that there's a 5% chance that your finding is a false positive. And a p of 0.15 does not mean that there's a 15% chance that the finding is a false positive. Why is that? It's because our p-value gives us the probability of our outcome, the probability of our data and everything more extreme. It doesn't give us the probability of the null hypothesis. So a p-value is what we call a conditional probability. It's the probability of our data, and I'm using data here, a shorthand for our observed outcome and everything more extreme. The probability of our data, given if the null hypothesis is true, that is not the same thing as the probability of the null hypothesis, given the data. Those are two different conditional probabilities. The only way to switch between those two conditional probabilities is actually to apply something called Bayes' Rule. Bayes' Rule is super cool because it allows us to flip conditional probabilities. So if I have the p-value, which is the probability of the data given the null, and I want to get to the probability of the null given the data, I have to apply Bayes' Rule to make that switch. Applying Bayes' Rule requires me to have some additional information, to incorporate additional information, and that additional information is called a prior distribution. So you need more information that's not available in the study in order to be able to make this switch and to talk about the probability of the null hypothesis. So just as an example, for that cherry juice study, remember that we tested the null hypothesis of a meaningful benefit for cherry juice, and we caught a p-value of 15%. So does that mean that there's a 15% chance that cherry juice has a meaningful benefit and therefore that there's an 85% chance that there's a 15% chance that cherry juice does not have a meaningful benefit and thus that there's an 85% chance that cherry juice is beneficial? No, that is not what that means. That is an incorrect interpretation of a p-value. In fact, if we were to actually do a proper Bayesian analysis here and calculate the probability that cherry juice has a beneficial effect using Bayesian rule, in fact, we would get a much lower probability than an 85% chance. And that's because if I were to incorporate a prior here, at least for me, my prior on cherry juice is going to have a lot of weight on effect sizes that are very close to zero. It's very, very unlikely that cherry juice is going to have a huge effect on running performance. As a former runner, I can tell you that nothing is going to, like cherry juice is going to make you run 30 seconds faster in the 5K. And so my prior is going to have a lot of weight on null and negligible or very small effects. That means when I do apply Bayesian rule here, I'm going to get a probability of cherry juice having a meaningful benefit that's much, much lower than 85%. We all know that p-values get misinterpreted sometimes, and they have lots of limitations and I'm sure you've all heard of those, but p-values do one job really well. So the value of a p-value helps us to not fool ourselves. The p-value does have a job that can be very useful. Take this cherry juice trial that I've set up, for example. In that trial, we saw a 20-second improvement in 5K running times. Again, as a former runner, like a 20-second improvement for a competitive runner in 5K times, that's huge. So you might look at those data and you might say to yourself, wow, that's an impressive effect. If we were to calculate the Cohen's D here, which I see is often used in a lot of sports science studies, that's the standardized mean difference. We get a Cohen's D of 0.65. That would be considered a moderate to large effect. So it looks really impressive. That point estimate looks impressive. And you might be tempted by looking at that to say, wow, like, I'm going to start drinking cherry juice because why not if you might have this big effect? But you have to be careful not to fool yourself because, in fact, when you have a small study in particular, like 10 per group, it's actually really easy to get big effect sizes, big point estimates in your trial just by chance. So what I'm showing you here is I went ahead and did my simulation again, but I calculated the Cohen's D. This is the standardized mean difference between the groups. And I'm showing you the results of the first 5,000 trials and these red-started areas, those are where I got Cohen's Ds of 0.5 or greater. That would be considered sort of a moderate effect size. What you can see is that the only 10 people per group, I can get really large effects just by chance. Sometimes I got Cohen's Ds of 1, 1.4 even, purely just due to chance fluctuation, getting a few extra fast runners in one group versus the other. Now, in contrast, if I used a bigger study, like if I have 50 per group, it's actually really hard to get moderate to large effect sizes just by chance if I use a big enough sample size. So chance fluctuation plays a smaller role in when I get an adequate sample size. What this means is, if I do especially a small study, or a study with a lot of variability, and I see this impressive point estimate, the p-value is handy because I can look at the effect estimate and then I can also look at the p-value and I can go, oh yeah, that high b-value reminds me that this is a result that could have easily or risen by chance. So it helps me to not fool myself. This is particularly important, again, in small to moderate size studies. Of course, p-values have tons of limitations. I'm not going to go over them all today, but I'll just point out a few. So a p-value depends on three different things. It depends on effect size and standard error. And as I've told you, standard error depends on the variability of the outcome and of the sample size. Any one of these can drive a highly significant p-value or even a large p-value. So for example, if I have a really tiny effect size, but I have a huge sample size like hundreds of thousands of people or tens of thousands of people, I might get a highly, highly significant p-value even though the effect size is minuscule. And that's just because a very big sample size can drive a small p-value. So you always have to interpret a p-value in context. You got to look at the effect size, too. One thing I'd like to point out that we're now that we're entering like the era of big data, maybe not so much in sports science, but in a lot of areas we have these huge data sets, 10,000, 100,000 people. p-values actually are largely irrelevant once we start talking about data sets of that size unless you're dealing with real events or like a small subgroup. Remember, p-values help us when it comes to kind of figuring out when chance fluctuation is at work. Well, if I had a huge sample size, 50,000 people, there's not a lot of chance fluctuation going on. I'm going to have extremely narrow confidence levels. So I'm not going to be seeing big effects through the chance fluctuation. So p-values don't really help me in that situation. I can just look at the confidence level. p-values also don't tell us anything about the quality of a study. You can have a garbage study that gives you a small p-value. So what? It's still a garbage study, right? All of these sort of limitations and misinterpretations of p-values have led to a lot of consternation within the statistical community. And there's a lot of debates going on that you might be aware of about significance testing in p-values. There's recent Nature article and lots of other articles have come out. Now you might think this is a new debate, but actually if you look back, these debates have been going on a while. Here's one from 1997. I think that sometimes when statisticians have these conversations, statisticians are not the best communicators. They end up giving some garbled messages to applied scientists, and applied scientists end up walking out of these conversations or feeling a little bit confused about what statisticians are saying. Sometimes I think applied scientists think that statisticians are saying, well p-values are bad. You should never use a p-value. You should never do a hypothesis test. For the most part, there might be a few statisticians who are saying I really want us to go to a Bayesian analysis, Bayesian world, but for the most part that's not what statisticians are saying. It's much more nuanced than that. We're not saying p-values are bad, never use them. We're just saying that p-values have a very limited role in statistics and you don't want to over-interpret them. And so I really like this statement that the American Statistical Association put out a few years back on p-values. They highlight things like what does a p-value do? It just tells us how incompatible our data are with a specified statistical model with our null value, for example. They highlight that p-values are the probability of a null hypothesis, the very important misinterpretation. P-values shouldn't be used to make a decision about policy. You shouldn't decide just that, oh, the p-value is less than 0.05, I'm going to do something. You have to look at the content. P-value is just one thing. You have to look at whether the study was anything. A p-value doesn't tell you about the size of the effect or the importance of a result. So a p-value by itself doesn't really do a lot. My bottom line, my summary of all this is p-value is one of many tools in your statistical toolkit. It does one specific job very well when used correctly, but it's only designed for that one job. So we can't attribute to the p-value a lot of other capabilities that it doesn't have. All right, so I am going to end there. I know that I've thrown a lot at you in about an hour here. The good news is that this video is caked and so if there's stuff you want to go back and review, you can do that later. I'll just find out a few other statistical resources. I do write a statistics column for a journal, the journal PM&R. You can link to those here. I have lots of different topics like linear regression, principal components analysis, confidence in a little scatter plots. I try to write those in a readable way for clinicians and applied scientists. And then I recently launched a medical statistics certificate program through the Stanford Center for Professional Development. That's a three-course series that will teach you about basic statistics, probability, and data analysis. And with that, I am now going to pass this back to your stewards. Yeah, I really like that. I really liked the kind of simulation approach being able to almost live talk us through the actual effects of what happens if your sample size is slightly bigger or smaller. What happens if you increase or decrease the variance, etc. Yeah, I think that was a really good way of demonstrating some of the possibly more tricky concepts as you said to understand when you just see it written down or when you just see or hear somebody talking about it. Yeah, so I really enjoyed that. I've got quite a few comments as well on YouTube. So there's I'll go for probably the first one. There's quite a few people asking similar questions, which you might expect in some of the science around small sample sizes. So there's a few variants of this question, but in general people are saying with elite athletes, or one person said, especially with Paralympic athletes, it's very hard or unlikely to get large sample sizes. And based on the fact everything you've shown is around the, yeah, how difficult it is with small sample sizes what would your advice be for people conducting their studies? Yeah, I mean, so there are a couple of different things. Yes, it may be difficult to get small samples. I would certainly encourage for scientists to try when possible to get large samples if possible. And so there are strategies like collaborating and looking at multiple institutions. Those are kind of strategies that might help you to increase your sample size. So my first answer would be don't just assume that it's impossible to get a small sample size. See what you can do to increase that sample size. Second point there would be, let's say you do just want to do a small study. Maybe then your goal in that study is not to do a hypothesis test. Maybe you're not after inferential statistics. Maybe you're just after descriptive statistics and that's perfectly fine. It's okay to publish descriptive results. Not every study has to be inferential and has to test the hypothesis. So think about what the goals in a study are. You might just want to describe. It's almost like if you have a 12 person study, it's almost like a key study, right? It's almost like... You froze for a minute. You're back. Perfect. I think we got most of that answered. Brilliant. Thank you. So then the kind of follow-on question from that is... Internet might have blipped out there. Brilliant. Right. You're back. Yeah, it's the same. We got most of that answer. Follow-on or similar question was about case studies, really. So if you're working with one single athlete, for example, how does that affect a lot of the assumptions you talked about? The question mentioned that each of your cases or trials are then dependent on each other because they're all from the same athlete. Yeah. We would have a totally different approach if you're looking at one athlete. There is something called an N of one trial. And from conversations with sports scientists, I often start thinking, oh, I think what they're after is this thing called an N of one trial. That is a methodology that is well-developed. So I would recommend going and reading about N of one trials because maybe really what you're after is that one individual athlete. You're not after making inferences to the larger population just after testing different methodology for doing it. It looks different than what I talked about today, obviously, because you're not trying to make inferences to the larger population. Yeah, I completely agree. And yeah, just going on with this kind of theme of things typically observed in sports science, another question was around multiple comparisons. So saying as well as small samples we often perform or we measure lots of different variables and we say which ones have significantly improved or which ones have significantly changed between groups, for example. Thank you for asking that question. That was the last slide that I cut out that I, you know, obviously my talk was a little bit too long and so I had a whole fight on that and at the end I decided okay I'm going to cut that one out just because I already have a lot to say but I have lots of thoughts on multiple testing and having looked through the sports science literature a little bit recently what I've noticed is that, yeah, multiple testing is a huge issue because the typical sports science paper that I've seen they tend to test 8, 9, 10 dependent variables and then in a lot of papers I've seen they look at multiple time points and they don't correctly analyze in a repeated measures framework so you end up with a 12-person study, you know, this is sort of the product I will say where you have 50 tests that you've run well, if you run 50 tests each with an alpha level of 0.05 or whatever alpha level you choose you are going to get p-values under 0.05 just by chance if alpha level is 0.05 if your 3-direction does nothing you have a 5% chance of a false positive and so if you run 50 tests then you're going to get p-values under 0.05 you know two, three times at least and so it becomes this really big problem that you're going to find things and you have to either interpret those in the context of hey I'm expecting three chance findings so let me put that as the caveat here if I find three significant things or whatever you want to call it I then say well oh this is consistent with chance actually so this is probably due to chance or you can do formal corrections from multiple comparisons I feel that that's a little bit hard in a lot of studies to do those formal corrections I think interpreting in context is fine or even better design your study with a primary hypothesis and time point in mind set up your study with a primary hypothesis and if you've got multiple time points do some kind of repeated measures approach yeah thank you the second kind of part of that question when you mentioned formal corrections they asked find where it's gone basically about Bonferroni corrections and saying are there any better alternatives than just kind of making it then very unlikely to find right so there's certainly all our alternatives to Bonferroni, Bonferroni would be considered the most conservative because it assumes that everything you're looking at is independent and of course if your outcomes are jumping running growing whatever it might be those are likely connected to one another and so they're not independent and therefore they're not going to be too conservative so there are certainly other procedures like the Hawkburg and the Home they're similar to the Bonferroni that they're easy to apply but they're less conservative so those would be things to look at I also think that we don't always need to do formal corrections for multiple comparisons but if you don't then you better interpret again in context or set a primary hypothesis at the beginning of the study so you're keeping your control on that one hypothesis and everything else is recognized as far as exploring yeah thank you that definitely makes sense just looking now I think we've roughly covered at least in broad conversations like most of the topics that people have asked about there's one more question now the last question that's just come through says great talk thanks Kristen I agree is there any marital use of the simulation technique demonstrated using sample statistics derived from a conducted experiment as the input yeah so I mean for most of our if you're just kind of using off-the-shelf statistics we already have the standard errors derived so you don't need to do this to say get what the standard error is however if you're going to be using a statistic that's a little bit different than our off-the-shelf statistics you could use a simulation approach or I understand if you've heard about like a bootstrap that's essentially what a bootstrap is doing it's using it's doing repeated sampling from within your own data so stuff like a bootstrap is kind of following the same general idea that you would actually empirically calculate that standard error so we do you'll see a lot of studies do bootstrap but you know if you're doing kind of off-the-shelf statistics you don't need to do that however you can use a simulation to help you understand those off-the-shelf statistics so you can set up simulations to help you understand statistics and when I am evaluating new statistics either because somebody asked me to look at it or because I'm going to use some statistic I've never used before you know often happens some new model that I you know just now implemented the first thing I'm going to do is I'm going to run a simulation that simulation allows me to understand how that statistic works right it's going to allow me to get a feel for how that statistic works and so that gives me confidence in that in my use of that statistic or if I've been asked to evaluate it helps me understand and to evaluate that yeah thanks for that yeah I think that's a really good answer and yeah so just say think we've covered most of that if anybody has or wants to find out more about some of these topics or have any kind of follow-up questions if they're watching it in the week to come is there any recommendations of how people can either get in touch or find more follow-up resources I'm sure yeah I gave a few links in my slides feel free to email me if you have additional questions I'm kcovidstanford.edu as long as your email isn't too long I usually respond quickly and yeah you know and then just the links that I already share thank you and then yeah before we go one last question with a laughing emoji at the end of it so can you ask Kristen does she recommend the use of magnitude based inferences I hope she left when you asked that I think people probably already know the answer to that yeah I don't recommend to use a magnitude based inference for a variety of reasons that I don't have time to go into here but I can refer you to papers I've written about it to explain why that's not a good choice for your data analysis yeah and for anyone who is interested I'd strongly recommend that they do read some of Kristen's papers on that topic which are really good and even if you do prefer to watch or listen to things then there's an excellent episode on the Everything Hurts podcast which I think was actually the first time I listened to that podcast was when Kristen was on it and then I kind of got hooked from there and yeah it's a really good podcast for anyone who's interested in science and scientific methods I know Kristen got into a lot more detail on that as well okay well yeah thank you ever so much for giving up your time and for such a really insightful presentation Kristen and yeah thank you very much thanks to you for having me thank you