 This video is about understanding p-values and confidence intervals through re-sampling from the data that we have for a research project. My name is Dr. Jean Clapper and I'm a data scientist and research fellow at the School for Data Science and Computational Thinking at Stellenbosch University. So imagine then we have subjects taken at random from a population. That gives us a sample. And we're going to gather data for two variables from our sample. The one is going to be a nominal categorical variable and it only has two sample space elements A and B. And that's going to allow us, you know, whether a subject is in group A or group B to create these two groups. And then our other variable is going to be a continuous numerical variable. And what this is going to allow us is to compare this continuous numerical variable value between the two groups. And one way to do it is to compare the means. So our test statistic is going to be a difference in means. Now for our subjects, we are going to find a difference in means and we want to know how likely was it to have found this difference. And that's what we're going to do through re-sampling. If we combine all our subjects, we have a continuous numerical variable value for each of these and we can express a mean for them as well. Now if we think about the population from which our subjects were taken, they have that variable and a value for each member of the population as well. And of course if we knew all of that, we can calculate the mean in the population and we would say that that is a parameter. And somehow our statistic has to be compared to what it might possibly be, what this parameter might possibly be in the population. So we have to express uncertainty in our statistic. And the way to do that is to calculate confidence intervals. And we're going to do this through bootstrap re-sampling. Now to do all of this, we're going to use a computer language and in the case of this video it's going to be the Julia language for scientific computing. And we're going to do our coding here in Microsoft Visual Studio and we're using the Julia plugin. So I've created an environment specifically for this project. So if we look down here, it's in a folder that I've called Data Science Julia. And if we look what's inside of that folder, we see that there is a project.toml file and also a manifest.toml file. So if we say Julia and dash dash project equals dot, that is going to just activate for us the project that's in there. If I hit right click, we'll see it's Data Science Julia. That's the environment that we're in. And if we look at just that ST, very small down there, but you can see all the packages that I have installed. We hit the back space and we are out of the package module there. So let's have a look at the libraries that we're going to use. So we're going to import data frames. What I'm going to do to execute this line of code is hold down option and hit return so that it'll be alt enter on a Windows machine. We see the little tick mark there to show us that that package has been imported. The next one that we're going to use is distributions because we're going to take from a normal distribution. We're going to do some plotting with the stats plots library. And we're going to set our back end at least to plotly. That gives us a bit of interactive plotting there. We're also going to use plot themes because I'm using a dark theme here in VS Studio, which means that we want to plot our, we want a female plot so that they also dark. We're going to use some functions from stats base. And then from lathe.preprocess, we're going to import the train test split function. And then lastly, we're going to just compare this resampling that we're going to do with students t-test and see if we get the same results. So we're also going to import hypothesis tests. So the theme, the theme comes from this plot themes library and we're just going to use the symbol there, solarize. So colon, solarized. And that's going to theme our plots for us with a dark background. So let's generate some data. As mentioned, we're going to take our values from a normal distribution. And we then going to generate two groups so that we will assign some of the values to one group and some of the values to another group. So we're going to use the ran function just to take random values. And then we're using the normal function here from the distributions package. So distributions.normal. So I put this dot notation there so that we can see where this normal function comes from. If you're not used to Julian, you haven't seen these functions before. You're not quite sure from what package they come. So when I've used the function for the first time, usually, I'll put the package that it is from. So we're going to use distributions.normal, takes two arguments, the mean and the standard deviation. And so we want, from a mean of 100 standard deviation of 10, close those parentheses, that's the normal function. And then the second argument for the ran function would be 1000, because we want 1000 values of a hold down option and return. We'll see that I've set in the settings that we get this in line solution, we see the results in line. So if I hover over there, we can see the values there. There's a thousand of them, of course, we won't see any. And then we'll use the sample function from stats base. So it's stats-based.sample. We give it this vector to choose from, this A and B. And I could put single quotation marks as well, because these are just single characters. But we'll have them as strings and we want a thousand of them. So there'll be A, A, B, A, B, little sample with replacement. So let's run that, and we have another thousand elements. So let's build that into a data frame. So data frame from the data frames package, that builds something that looks very similar to a spreadsheet file. So we can create the column headers. So if you think of a spreadsheet with the first row and that first row at the top of each column, we'll give that the variable name. So we're going to call our first one var, var with the uppercase v and then group with uppercase g. And to that we assign these thousand values var and these thousand values group. And that's going to give us a data frame. And if we just wait for that to execute and we hover over that, we'll see that we now have a thousand by two data frame. So we can see there the rows on the left hand side. Julia is one index, so it starts at one. And so we see all the row values there. That will be all our thousand observations. We see the variable var and the variable group. We see that variable var was noted by data frames to be 64, but floating values and group that was interpreted as a string. Next thing that we want to do is we want to create two subgroups. And we're going to group by the sample space elements in the group variable. And we know that A and B, so we're going to create two sub data frames. And I'm going to call them GD for group data frames. And that's dataframes.groupby, the group by function from the data frames package. First argument is the data frame itself. And the second one in symbol notation, so it has to have a colon in front of it, is group. So we're going to group by the sample space elements of that group variable. And if we do so, we now have two data frames. As you can see there, you'll see the first one. And of course, the group will all be just A's. And in the second one, they'll all just be B's. And you can see how many rows were emitted there. So summary statistics as always first. So let's have a look at what the mean is for all the subgroup that just contain group A. So the way that we do that is we're going to see GD. And GD has got two sub data frames, one containing all the A's and one containing all the B's. So we use index notation. So we're going to say GD and then index, that's square brackets, one, that'll be the group A. And GD, two, that will be all the group B's. And we're passing all of that to the mean function. By the way, we put dot var, because we're looking down the variable column. So for the data frame, it's just A's. Now to look down the var column, we calculate the mean and the same for the group B. And we just assign that to computer variables mean underscore A and mean underscore B. And then we're also going to have a mean underscore all because we're just thinking about all the subjects in our data set here, and we calculate the mean of that whole original data frame down the var column. So let's do those three. I'll hold down option and hit return. So we see in my instance, remember this, we did not see the C to random number generator. So every time we run this code, we're going to see different values. So in this instance, we have a mean of 100.8 and 101.1, just rounding off a little bit there. And let's look at the overall mean. And we see that's 100.95. Now I just want to save the difference in means between group A and group B. But remember, the order of subtraction matters here. If group A minus group B, so the mean underscore A minus mean underscore B is going to be different from mean underscore B minus mean underscore A. One will be negative and one will be positive, depending on the order of that subtraction. So we pass that difference to the ABS absolute value function so that we just have the positive of that difference. And we see the difference is 0.22. Let's visualize this. And what we're going to do is we are using stat splots. So we can make use of this macro at DF. That's going to be a macro generating code that's specific to using a data frame object in our plotting. So at DF, and then we have to use the name of our data frame, so that was DF original. And then we're going to do a box plot. And in the box plot, with a box plot, we want on the x-axis all the sample space elements of our group, so A and B, so we have two box plots. So we're going to say string dot. So we're calling the string of all the values that are in there, they're already strings. But just to make doubly sure you can do that because if they encoded a zero and one and they look like integers, we're just going to make sure that they are strings. So the dot notation there, so for every element, and we're going down that group column. So we have to use symbol notation, colon group. And then on the y-axis, we just want the var variable, that column, so colon var. We put a title and a label and let's hit option and return. Now we know we're still going to have to wait a little bit for our first plot here in Julia. By the way, while we wait, the storm is just really getting worse and worse. Wintertime in Cape Town, plenty storms. And there we see we have our first box plot here on the top right-hand side. It's a box and whisker plot. It's plotly, so it's interactive. If I hover over these elements, we'll see some of the values there for us. We see the quartile values, we'll see the fence values, et cetera. And we'll see there that we have some suspected outliers as well. So now what we want to do is what the probability of our test statistic was. Remember, our test statistic is this mean difference. We've done some research, we've generated two groups, hopefully by unbiased sampling from a population and we capture value for a certain variable and we're interested in our test statistic, which is the difference in means. We have a difference of 0.22 and it might be negative 0.22, depending on the order of subtraction. So we just want to know how likely was it to find that difference, the probability of our test statistic. And what we're going to do is we're going to build a sampling distribution of means. So what we need to know just for this exercise is how many subjects were in group one and how many were in group two. So the group A and the group B, what were the sample sizes. We're just going to assign that to computer variables N underscore one and N underscore two and in my case I see 505 and 495. And what I want to do is to capture the fraction of the group ones over the sum total. So we see the group one made up 50.5% of the whole data set because what we want to do now is consider our null hypothesis and for two-tailed null hypothesis, we're going to state that there is no difference in means between the two groups. Now if that null hypothesis is true, it really does not matter to what group any subject was randomly assigned. Under the null hypothesis, it should make no difference because we're saying that the means are equal in both groups. And that's what we're going to simulate. What we're going to do is create a little function to split the data. So we use the function keyword here in Julia, give it a name, sampling underscore IDX and it's going to take two arguments. Now one is going to be our set of values and the second one is going to be the fraction. So what we do with train test split, remember that's from the latest package, we need two arguments there for train test split, this list of values and the fraction. And what it's going to do, remember any vector inside of Julia has an index and Julia is one index. So the first element would be one, have an index value of one, second one, two, three, et cetera. So what the train test split is going to do is it's going to generate a list of two indices, so two lists of indices. I'm going to call them IDX one and IDX two and the fraction, so the 50.5% at random, so it knows how long this vector is that you're going to pass it. And so it knows from the fraction how many it has to extract as random and that will be the indexes one and the IDX two will be all the remaining ones that were not chosen. So that I can do the following neat trick. I can say var, remember that was my variable that contains all thousand values for our continuous to make a variable. I just want the ones that are in this index. So this index is going to be random values making up 50.5% of the total. And G2 is going to be the IDX two, the ones that were not chosen. And what I want to return is the mean of now this reassigned group minus the mean of this reassigned group. And that's exactly what's happening. I've put all the subjects together in one bowl and we're just randomly reassigning them to one of two groups. So again, I say under the null hypothesis it shouldn't make any difference. So I'm just going to hold option and hit return and we have our genetic function now with a single method. Now, what are we going to do is to do 2000 of these reassignments under the null hypothesis. So all the subjects go into a head and we just draw them out at random and some of them go into group A and some go into group B or one and two. Doesn't matter what it is, G1 and G2 as we've labeled them here. And we calculate a mean so that we build up 2000 possible means by reassignment. So I'm going to use a computer variable reassign and set that to 2000. And then we're going to use a little bit of list comprehension and we're going to build up this value statistic and to that I'm assigning this list of values. So sampling underscore IDX var comma frack. So var we know is that our function here takes two arguments. So the var is that 1000 values and the fraction is 50.5. So it's going to split them up, assign them, do the means, subtract them and I have a statistic and I'm going to do that for I in one to reassign. So for one to 2000. So we do that 2000 times. So let's hold down option, it enter and we now have our 2000 test statistics. So it's a sampling distribution of test statistics under the null hypothesis. So let's do a histogram of that sampling distribution and there we see we only did 2000 but it's approximating a normal distribution here. So if we look at the X axis of the histogram we see all the values for our test statistic, our difference in means and our difference in means is right up here. So from here in the Julia, here we are in Julia workspace here and we see there was mean diff, it's a bit small but it was 0.22. So our difference was 0.22 which is somewhere on the X axis. Remember we've also got to have negative 0.22. It's a two-tailed alternative hypothesis that the mean means are not equal to each other. So I've got to reflect it on either side. So what we now want to know from these 2000 possible values, what fraction of that was less than negative 0.22 and what fraction was more than 0.22? So if we go to 0.22 and we look at the fraction of values below that and the positive side, the fraction of values above that, that will be, give us an indication of the likelihood of having found the test statistic that we did or more extreme. So to negative infinity and positive infinity side. So let's do the less the one and we're gonna call that less. So we're gonna sum over, we take our statistic, remember that is now all these values, these 2000 values, dot less than minus the mean difference. Remember we took the absolute value so it's gonna be a positive value. So we want to know dot less than. In other words, we do it element wise. So how many of them, it's gonna return true or false by the way and because true is stored internally as a one, we can sum over all those ones. It'll give us all the values in the set of 2000 that are less than negative 0.22 and we sum over all of those and then we divide by the total number, they were 2000. So if we do that, we see 35.5% of values in the sampling distribution was less than negative our test statistic. And if we look at the more than, we see also about 36%, about 36%. So we can simulate our p-value by just adding those and we see a p-value of 0.72, thereabouts. So it was quite likely to have, under the null hypothesis, it was quite likely to have found this result that we have found. Let's just compare that to a proper test that we're gonna use from hypothesis test, we're gonna use equal variance t-test. So we're just going to ignore the assumptions for the use of parametric tests here. We're just gonna use students t-test and we pass these two values, all the values in group A and all the values in group B and let's have a look. If we run the students t-test, if we hover over there, we see under the null hypothesis, we said that the difference should be zero, the difference that we found was negative 0.22, we took the absolute value of that and then look at that, a two-sided p-value of 0.71, 0.71, very close to, just through simulation for only 2000 values, our sampling distribution there gives us that probability. So let's look at uncertainty in a statistic. So the statistic that we're going to be interested in here is just the mean of all the subjects in our study. Because we collected 2000 subjects, we at random, randomly divided them into two groups and that was what we're simulating here and overall there'll be a mean for all these subjects in our study and we want to infer that statistic, the mean of our variable to the population from which it was taken because we do not know the parameter in the mean, in the population, this mean for the population for this variable, we only know the values for our sample. So we've got to express some uncertainty. We cannot say that the mean in our sample is exactly equal to the mean in the population. There is uncertainty in that because we don't have the whole population and we do that by the method of bootstrap resampling. So again, that's resampling with replacement. So what we're gonna do is the following. You're gonna do it 5,000 times over this time, just a choice. So what we're going to do is we're gonna sample from DFVAR, so all my values from both groups, all the subjects and if I say the length from the sample should also be the length how many they were originally, it sets resampling with replacement. If we didn't do replacement, I'm just going to get back exactly the same set of values that we had before. We don't want that, so every time a value is taken, it's thrown back into the head so it can be chosen again at random. So we're gonna build up this resampled set of values exactly equal to what the original sample size was and we calculate the mean of that and we do that a thousand times and again, we put that all in square brackets because we're using a bit of less comprehension there. And we do that, we now have a thousand possible means through bootstrap resampling. It's gotta be the same size as the original. Let's have a look at a histogram of this and we can see a histogram of all possible means. Now we have to express or decide on a confidence level. And we usually would choose 95% for our confidence level because from that we can calculate 95% confidence intervals and the interval is the lower bound and the upper bound. But here is a distribution of possible means just given our sample and we would suggest that the parameter that is that same mean for the same variable in the population at large will be one of these. We found one, the mean. So we know what the mean is for all of these and we see it way up here, mean was 100.9. We know what the mean is for all our subjects and we want to place these 95% confidence intervals. So somewhere from this side all the way over to that side that we think the true population parameter would fall under. And that becomes very easy if we use the technique of percentiles because if we want a 95% confidence interval that means we need 2.5% on the lower end and 97.5% on the top end. And we know that we have 5,000 values and if we put them all in order in ascending order, all these bootstrap means that we just need to know the index for the value that would represent the 2.5 percentile, 2.5 percentile and the 97.5 percentile. And that's all that we do here is we say 2.5 divided by 100. So right there, 2.5 divided by 100 that gives us 0.25, exactly 0.025, I should say exactly what we want times 5,000 and that's going to give us the index value of these sorted 5,000 values that would represent the 2.5 percentile. And we do the same at the top end, 97.5. So if we do that, we get back the index value. So if we have 5,000 values and they are sorted in ascending order, the one with index 125 would represent 2.5% and the index number 4,875 would represent the 97.5% line. And what we're now going to do is to sort all these means as we can see there and then we take that index, sort all the means and take that index and we're just going to round them to two digits so that we just have these lower bound and upper bound for the 95% confidence interval. So in this last statement, you can just put that in a little print function there but we can see the mean, the overall mean and the 95% confidence interval around that mean. That doesn't mean we are 95% confident that the population parameter will be within these bounds. Remember what it really means is if we could do our study a hundred times over which we can't do, it's expensive, takes time, money, personnel, so usually we can only do a study once but if we could do it a hundred times over, every time we'll have a random new subject so we're going to have new means and we're going to have new confidence intervals that in 95 of them, the population parameter would be within the bounds that are set and in 5% they won't be, so we don't know which one we have here but we can still express this as our 95% confidence intervals. So that's it, how we can use resampling, both to estimate the probability of having found a test statistic that we did for a study and also how to express uncertainty in one of our statistics and we don't necessarily just have to have the 95% confidence intervals in the mean but also for median or for standard deviation or for variance, we can express through bootstrap resampling that means resampling with replacement and of the same size as the original sample we can express this uncertainty in the statistic.