 This video is about understanding p-values and confidence intervals through re-sampling from the data that we have for a research project My name is Dr. Jean Klapper And I'm a data scientist and research fellow at the school for data science and computational thinking at Stellenbosch University So imagine then we have Subjects taken at random from a population that gives us a sample and we're going to gather data for two variables from our sample The one is going to be a nominal categorical variable and it only has two sample space elements a and b And that's going to allow us, you know Whether subject is in group a or group b to create these two groups and then our other variable is going to be a continuous numerical variable and What this is going to allow us is to compare this continuous numerical variable value between the two groups And one way to do it is to compare the means So our test statistic is going to be a difference in means now for our subjects We are going to find a difference in means and we want to know how likely was it to have found this difference And that's what we're going to do through re-sampling if we combine all our subjects We have a continuous numerical variable value for each of these and we can express a mean for them as well Now if we think about the population from which our subjects were taken They have that variable and a value for each member of the population as well And of course if we knew all of that we could calculate the mean in the population and we would say that that is a parameter and somehow Our statistic has to be compared to what it might possibly be what this parameter might possibly be in the population So we have to express uncertainty in our statistic and the way to do that is to calculate Confidence intervals and we're going to do this through bootstrap re-sampling now of course to do all of this We can use a computer language and in this video. We're going to make use of the wolfram language. We've opened our wolfram notebook I've called a test statistic estimations where we're going to do some bootstrap re-sampling to calculate confidence intervals That is of course a measure of uncertainty in our results seeing that we only have a sample from the population And not the data from the whole population and then also Calculating or simulating the probability of having found the test statistic that we have found so let's just simulate some data And we're going to imagine that we have a continuous numerical variable and for that continuous numerical variable We have two groups. So we have a categorical variable with two element sample space and from those two unique elements we form two groups and Then for each of them we have a single continuous numerical variable and that is then collected for each group So let's just simulate that we have a group in And the number of the samples in that group is 100 and in our group 2 there is a 105 the 105 Subjects to witness just to sign those to the computer variable names in one and in two So let's just simulate some random values and we're going to sign them to the Computer variables var 1 and var 2 as you can see there both will be taken from a normal distribution The first with a mean of 100 and standard deviation of 10 and we want in one Number of those so that's a hundred and the second comes from a normal distribution with a mean of 103 a standard deviation of 13 and we want 105 of those and we also see the pseudo random number generator If we if you said 12 you're going to get exactly the same results And the next thing I want to do is because under the null hypothesis. We're going to reassign Each one of these values to a random group We're just going to flatten those two after appending that so we're going to take var 1 and we append to that var 2 and We pass that as parameter to the flatten value. So what we're going to have here is 205 values as a single list object Now summary statistics and data visualization remember that comes first so that we can get some idea of what is going on So we know where we took the random what distribution we took the random numbers from and values from But let's just see what the means were because we only took a hundred and a hundred and five values So we see a hundred and one point of five and a hundred and three point nine one nine Those are the two means and that is the test statistic that we are going to use both for our estimation in our uncertainty and For the calculate the probability of having found the test statistic Which in our instance is going to be the difference between these two means Data visualization always very important. So let's just do a box and whisker plot of these two and sort of start seeing whether we think there will be a A statistically significant difference between those two if we can use that term and We see the box and whisker plot there. So let's talk about the uncertainty in our results. Let's just concentrate on group one there and for group one We have a mean. Let's just recall what that mean is and we see it's a hundred one point five now that those subjects they came from a population and We need to to express our uncertainty in what we think this this statistic is in the in the population So what's the parameter for this continuously medical variable in the population from which the sample a group was taken? So we have to set a confidence levels and as is common in many fields. We'll use a 95 percent confidence level So how do we go about this? We don't know what the variable values are for the whole population But we want to we want to put some bounds a low and an upper bound To express a 95 percent confidence level For an interval which we think the population parameter will fall between and we're going to make use of bootstrap resampling So what is bootstrap resampling? It says take all the values that we have so for group one all all those hundred random variables so those values for our continuous numerical variable and We're going to select One subject from that so we throw them all in a hat we draw one and we write down the number we take that number and we throw it back in the hat and Mix it around draw another one Capture that value throw it back in the hat draw another one. So that's with replacement It's resampling with replacement That means there are going to be values in there in that hat of those a hundred values that we have for that sample for that group are necessarily going to be drawn more than once and That's exactly what we want if we didn't do replacement, of course If we just take one out and not throw it back, you know Resample will be exactly the same as the original and that's not what we want and because we have Computers we can do this many times over. So how do we do that? Well, we use the random choice function as you can see there We pass Va one which contains all our values and How many do we want while we want? Length of our one so exactly the same number of subjects now if we do that if we have This list here and the length of the list in random choice the Wolfram language does do resampling with replacement So no concerns there and then we're going to calculate the mean So we're just going to do that resampling so that we have exactly the same number as the original Same sample size as the original we calculate the mean We pass all this to the table function and we do it a thousand times over and that means we're going to have a thousand means And they're all going to look slightly different because the resampling with replacement means every time we do this every time We're going to get a different mean Because we are going to draw this random choice No, it's gonna it's going to draw exactly a hundred samples for us Exactly the same length as the original and we calculate the mean from that next time we do this You know, we're gonna have different values different ones are going to be repeated So let's do that a thousand times over and now we now have these means let's plot a histogram of these means So that's a distribution of the bootstrap a means for group one So there's all the means For group one now we found a specific one of course a hundred and one point five But because we've done this bootstrap resampling with replacement in other words with replacement We get all these other means as well And what we want to do now is to look at Putting them all in order as you can see here on the histogram on the x-axis put them all in order And then we're going to look for if we're dealing with a 95 percent a 95 percent confidence level we want to know The value in these means all these means that we thousand means that we have now that represents the 2.5 percent percentile and the 97.5 percentile Because that will give us this 95 percent in the middle So percentile 2.5 percentile 97.5 So we put all these values in order and we now have to do but we have a thousand So we have a little equation in here that we have in in one here So it's 2.5 divided by a hundred. So that means 0.25 That's the 25th percentile times our sample size and that gives us a value k And that is going to be if we put the values in order if we look at the chaos of value That is going to be represent The 2.5 percentile value and for we'll also do the same for 97.5 And if we do that We'll pretty much soon find out that what we are looking for is value number 25 and Value number 975 so we use those indices if we put them all in order So we know sort the all those means thousand means that we have and we want the 25th one That's going to represent the 20 the 2.5 percentile value So let's do that and we get it's 99.039 and we want the upper one as well and that's a hundred and two Point 102.7 i5. So what we can do now is just Plot this and there you can see I've plotted it There's our dark black line in the middle that was the actual mean for For that group and we see the 95 percent confidence intervals So we can say that the mean was about a hundred one point one and The 90 95 percent confidence intervals 99.2 to 102.7 Now that does not mean we are 95 percent confident that the sample the population from which the sample was taken has this Parameter this mean for this continuous numerical variable between these two It really means is if we were to repeat our study a hundred times over which of course is impossible to do in real life We don't have that kind of money or time or resources Every time we do that we would get a different mean slightly different mean if we have new samples We start our study a week later. We'll have new subjects in our study and 9 in a 95 of those each one will now have slightly different confidence intervals 95 of them out of a hundred Would have the population parameter for that variable within those confidence intervals We don't know if this is one of those 95 But certainly we can express now Our uncertainty in our results. It was just over 101 But we see this 99 to 102 or 103 there abouts We see the uncertainty in our results expressed as confidence intervals So let's now just Recalculate the probability of our test statistic So under a null hypothesis we would say that there is no difference in means between these two groups And we see our null hypothesis up here and I'll turn it up hypothesis There is a difference between the two means and we've done the mean for group one as this X sub one bar and the mean for a group two X sub two with a little bar over So under this null hypothesis Which has there's no difference in the means it means we can take all those 205 values and Randomly reassign them to group one group two group one group one group one group two group one group two group two You can just dish them out At random it doesn't matter what group they belong to because we just said under our null hypothesis that the two groups are the same So we're gonna do that by creating a little function And we have it here a user defined function random groups it takes a variable and we're gonna call it bar underscore one and We assign to that the the module a module just gives us these Parameters that are local to our function that we can work with and we'll call them G1 G2 And what do we do? Well, we use the take drop function and We do a random sample of var all remember that is where we put all of them together And we flatten that list and we want to draw in one from them So what the take drop is going to do it's going to randomly select mix them all up var all and take random In in one remember in one is under so take a hundred random values from there and That will go to G1 and then the just rest will go to go to G2 and Then a semi colon because thereafter. What do we want this function to return while we want the mean of G1 minus the mean of G2? So that's our function and now we're just going to use this function a thousand times over so I'm gonna Say means 1000 that's going to be my computer variable name I'm going to create a table of a thousand values as you can see there And what do I want? I'm calling my function with my flattened combined list So what it's going to do a thousand times over it's going to randomly reassign subjects to group one and group two Calculate this new the new difference of means because the difference of means is my test statistic So let's run that and let's create a histogram of What we would call a sampling distribution of the test statistic So in most cases we'll see there'll be no difference between the two and as it goes out it becomes less likely To have such a value Now let's capture our test statistic In a computer variable actual mean diff and we just do the difference between those two means and we see it was minus 2.86952 We do remember though that our null hypothesis is that they equal in our alternative hypothesis is that they different It doesn't say one is more than the other. It's not a one-tailed hypothesis. It's a two-tailed Alternative hypothesis So we could also have said mean for var 2 minus me mean for var 1 and we would get positive 2.72 So we would just you know have to remember Have to remember that that fact So if you look at minus 2.0, I should just correct that if we look at minus 2.86. It's way out here and Positive will be way out there and all we want to know now is if we put all these Difference in means that we've just calculated Those mean one mean this one means 1,000 if we put them all in order we want to know What fraction was less than this and We add to that the fraction that was more than the Positive of that positive 2.86 and that's exactly what we do here Fraction lower and fraction highest what we're going to calculate we do in a miracle computation there We want a numerical value. We say the length of the following select Use the select function Means one thousand and then this placeholder remember the placeholder simply just the hash tag symbol their pound symbol The ones that are less than the actual mean diff Don't forget your ampersand there because we want element wise. So it's just going to select all the ones that of all my Simulated differences here We just want under the null hypothesis We just want the fraction of those because we divide by a thousand there We want the fraction of them that are less than this difference and then the fraction that are more than positive that difference and that is going to give us open 028 and 0.023 and if we combine them together that gives us the probability of having found This difference that we have under the null hypothesis. We are now Simulating what fraction would be this difference that we found and more extreme on either side So let's just do a little histogram of that and there we have it. We have these values here That is our test statistic on either side and you know, we now have a simulated p-value there From our resampling. Let's just use the students t-test and just get a p-value For these two and we see 0.051 and we calculated 0.051 very very close These two these two values So that's it. It's very simple to do bootstrap resampling Just so that you can express the uncertainty in Your test statistic and in the case we only looked at the mean for group one But you can also look at standard deviation variance whatever test statistic you're interested in and then for our actual research question here, we had a null hypothesis alternative hypothesis and We we do a sampling distribution of possible means under the null hypothesis where we just reassign at random Our values to to the two groups because under the null hypothesis. We say that the means are equal in both groups So I hope this gives you a good understanding of how We can simulate this Uncertainty in our results and how we can use a sampling distribution of our test statistic to determine How likely it was to find this test statistic we did and more extreme on either side under the null hypothesis