 So we've learned so much by now, we can really explain where our results fit in in a bigger picture. What we're now going to do is express uncertainty in our results and in data science and any kind of science, uncertainty abounds. It's just part of the nature of the scientific process. Caleb notebook number 10, uncertainty. So when we make measurements, we can never be precise. Think of just measuring how tall you are. There's some uncertainty in that measurement when data is captured. When we generate this random variable, remember that's a function that generates a value from a measurement. There can be some error. There can be human error just entering the data and reading the data. There's uncertainties really everywhere. And most of all, we only take a sample from a population. And by the law of averages, we will approximate the results of the population if our sample size is large. But usually our sample size is a fraction, a tiny fraction of the population. So we do not approach the true picture in the population. So we've got to somehow express that uncertainty in our results. And we're going to do that in this notebook through a method called bootstrapping. And that'll help us to determine confidence intervals. And I'm sure you've heard of the term confidence intervals. So a quick look at the packages that we're going to use in this notebook. Of course, NumPy, SciPy, and Pandas. And then from Pandas, we're just going to use the data frame function. And then we're going to import the stats module from SciPy. And then when it comes to Plotly, you see I'm just importing. Most of the modules there in Plotly, no problems there. And just for a bit of fun, we're going to set a different theme today. So pio.templates.default. We're going to set that to ggplot2. Why not? That is a plotting library in the R language. So bootstrapping is this technique of multiple resampling for our given sample. I just want to pause a little bit. That's not what we did in the previous notebooks. There we just reassigned subjects to different groups. Because under our null hypothesis, we had equal means or whatever our statistic was, we said that that was equal. And so we didn't care in which one of the two, which one of the groups there were. Bootstrapping is something different. We're going to continuously resample also only from our samples. But we're not going to, we're going to do that with replacement. So that's a bit different. So what I want to do here, I want to generate a population. So let's set the random.seed method there. Let's pass an integer of 10. I'm going to create this computer variable called population. And to that, I'm going to assign an umpy array. And that array comes from the norm.rbs function inside of the stats module. Remember of scipy. So it's stats.norm.rbs. And so it's going to take values from a normal distribution with a mean of 100. And a standard deviation of 10. And I want 20,000 subjects in my population. So that subject, that the population can be anything. So I'm just trying to be abstract and generic. Now we have our population. In other words, if we calculate a point estimate or a measure of dispersion, that is going to be a parameter of the population. So let's look at the mean. The mean parameter in the population is 99.99. And the standard deviation of this variable in our population is 9.98. So we know that. So we have knowledge of that at this time. So for our study, we randomly select 15 individuals from the population. And we're going to set the replace equals false. We don't want the subject, a certain subject to appear in our dataset twice. So we're going out to do a study. We only have the resources, the money, the permissions from the Research Ethics Committee, whatever the case might be. We can only get data on 50 individuals. So we go out and in an unbiased form, we're going to select 50 individuals or subjects from our population. And so we're going to use the numpy.random.choice function. So we choose from the population 50. And we set replacement equals false. And of course, we're going to see that pseudo-random number generator so that we all get the same values. Now let's look at our sample. Now what I want you to imagine here, we don't know the population. Of course, we know this on the back end here, but imagine that we don't know. So all we know is the data of our 50 subjects. And we see our mean is 97.8. And our standard deviation for this continuous numerical variable in our sample is 10.84. Now obviously, these statistics, remember this is a sample. So a point estimate or measure of dispersion in a sample is called a statistic, not a parameter. So the statistics are different from the parameters. And we somehow want to tell the world that if you work with this population, if you work with individuals from this population, this is what you should expect based on our results, referring our results to the rest of the population. But how certain are we about this 97.8 as far as the mean is concerned or 10.8 as far as the standard deviation is concerned? Or you can think of other point estimates or measures of dispersion. We need to understand the uncertainty in this result. So what we're going to do now is this process of bootstrapping, the bootstrap resampling. And what that means is we're going to create new units of 50 subjects, but we're going to select them at random with replacement. That means any subject in our 50 from our sample can appear two times, three times, four times, five times in this new batch of 50. And we're going to do this over and over and over again. It's important that we still have 50 in each little new resample of ours. But I think it's very important for you to understand that some subjects can appear twice or three or four times now. And necessarily they will because if they don't, we just have the original sample again. And we don't want that original sample again. We're just resampling from our 50 over and over again. So look at this. Random.choice. So the choice function there from my sample 50 replace equals true. So if a subject is chosen, that subject is back into the hat and that subject can be chosen again at random to make up this 50. And I'm going to do that a thousand times. And what I'm going to do is calculate the mean. So I'm going to have this list of 1000 means through bootstrap resampling. And I'm going to assign that to the computer variable means. So now let's look at a distribution of these possible means. And I'm using the create underscore this plot function from the figure factory module and plotly. And you can see all the possible means through bootstrap resampling. Same sample size, but I'm doing it with replacement every time. And I've done it a thousand times. And you can see our little kernel density estimate there. Beautiful nice curve there. And you can see the histogram and a rug plot at the bottom of all the values. And all we have to do now is to determine we've got many, many, many, many here. Let's see, we've got 1000 values here. And what we're going to do is we're going to place them in order. So from the smallest mean to the largest mean, and we're going to determine some of the percentiles. So imagine we considering the 95% in the middle of all the values. So we're going to cut it off somewhere here. And you're going to cut it off somewhere here so that they're 2.5% at the bottom and 2.5% at the top and the 95% in the middle. So that means I need to know the 2.5% percentile value and the 97.5% percentile value. Now, it's not as easy as it sounds. There's a little formula attached to this little equation because we're obviously going to have some values that are repeated. And we're dealing with a continuous numerical variable here. So it can be slightly different. So we have this little approach that we do. So it is one to four. Let's read that sort the collection in ascending order or my means from my bootstrap re-samples. I'm going to calculate this value K for which is 2.5% of N. So 2.5%, that's 2.5 over 100 times N, my sample size. And if K is a whole number, then we're interested in the Kth value in that order and that's going to represent the value of all of those possible means that is at the 2.5% percentile. And if K is not round number, a whole number, we're just going to round it up to the nearest whole number. And that gives me K. And then that's at the bottom bit. Of course, we've got to do that at the top bit as well. So that's 97.5. So that's what we do here. And because we've got 1,000, because we have 1,000, the calculation is very easy. So I'm going to call it K underscore 2 underscore 5 or K 2.5%. And that's the value is 2.5. So our K here is 25. And at 97.5, our K is going to be 975. So if I have all my means in order, the 25th and the 975th one, that's going to represent the 2.5% percentile value and the 97.5% value. So let's sort these means. So I'm going to create this sorted underscore means and that's np.sort. Use the sort function from NumPy and that's going to sort them for me. And by default, that's going to be ascending. And now I want the 25th value and the 975th value. But remember, Python is zero indexed. So I actually want the 24th index and the 974th index. So let's do that. So the sorted means the 24th index. That's the 25th one is 94.4. And when we look at the top one, that's going to be 100.9. So these are the two means of all the possible means that we did from the bootstrap resampling. All the possible means that lie at the 2.5% percentile and at the 97.5% percentile. So let's plot all of this with the create underscore doesplot. And this is what it looks like. There's all our possible means, a thousand means through resampling. There's the population mean. Remember, we know what the population was. So there it is. And these two yellow lines, of course, are these values that would represent 95% of all our resampled means here in the middle. Right there in the middle. And look at that. The real population mean which we do not know. Of course we know this here because this is a contrived example. But the real population mean are within those two bounds. So we had a mean for our study. We were not sure, you know, how accurate that was a reflection of the actual population mean. And so we have got to express this uncertainty in our result. And what we do is we do this bootstrap resampling and we decide on what we want the middle 95%. So we get these percentile values and all these bootstrap means and we put these bounds of uncertainty there. And we say, well, we think that the true population mean is somewhere between these two. Beautiful. Now that's a choice of 95%. Why do we choose 95%? We perhaps want the middle 80%. So that's going to narrow our bounds a little bit. So let's have a look if we do the same calculation and we do an 80%. So now these lines now represent 80%. And the actual population mean is now outside of our little estimate here. Our lower and upper bounds. And that's a choice we make. And these things are called confidence levels. We decide on a confidence level. So in our first example, we had a 95% confidence levels and here we have 80% confidence levels. Our confidence has dropped to 80% by only representing or selecting the middle 80% of all these bootstrap possible means that we calculated only from our sample. And the actual values for that confidence interval for those confidence levels, I should say, we call the confidence intervals. So the lower bound and the upper bound are our confidence intervals. So let's have a look at that again. So there was our sample mean. We only had 50 subjects to work with. We did this bootstrap resampling and we calculated our lower bound for a 95% confidence level. We calculated our upper bound for a 95% confidence interval. And this is what we report. Read the sentence. When expressing the uncertainty in our sample statistic, we would then state the variable mean in the sample was 97.8. And then 95% confidence intervals, 94.4 to 100.9. We set these bounds and we say, this was what we found. But we think that the true population parameter was somewhere between these two bounds. And you can see with 95% that's quite broad. We push our shoulders out to the side. If we become less and less confident, those bounds shrink a little bit. And so what does this all mean? Does it really mean that we are 95% confidence that the population parameter is within these bounds? Unfortunately not. It sort of makes sense to say that, but that's not quite the truth. If you think about what we did here, the actual truth was if we could do our study 100 times over. Now I'm not talking the resampling here, I'm talking the actual study. We go out, we select 50 individuals and we do our study. And next week we go out and we select 50 and we redo our study. Impossible to do, but imagine we could and we did that 100 times. Now every time, most likely every time, we're going to get a slightly different sample mean in reality. And we're going to do bootstrap resampling and we're going to get slightly different values for the low and upper bounds. And all we can say that if we do this 100 times, if we could do our study over 100 times, in 95% of those, if we choose a confidence level of 95%, in 95% of those we will capture the population parameter within our bounds. And in five cases we will not. Which one we have here we don't know. We only have this one, we only do the study once. It's likely that it is, I mean it is 95 to 5. But we're not sure whether this one that we have here is one of the 95 or one of the 5. We can only set these confidence bounds and confidence intervals based on our confidence level for this study that we have. And that is how we express uncertainty in our results by confidence intervals. And we did that here for the mean. But you can imagine any other statistic that we work with, you can do the exact same thing with. And you can express the confidence in your result. And in the next notebook that we're going to talk about correlation and linear regression and modelling, we're going to talk about the uncertainty there as well. How certain are we in these calculations in these statistics that we find. Because we only have a sample, we do not have access to the population. In this instance we did because I wanted to show you when it fell in and when it fell out. But in reality remember we never know the whole population, hardly ever know the whole population. So we have to express some uncertainty in our results. Now instead of doing all of that, doing the resampling of course there's the mathematical way. So in SciPy we can calculate the confidence intervals. So in the stats module if we use the t-distribution, which we will often use, there are some assumptions that must be met but let's just use it and we have the interval function. We can set our alpha value 0.95. So our alpha value, remember we actually mean that to be 0.5. So this is slightly different but that's the keyword argument that is used here. The degrees of freedom is very important. Here we only have one sample so it's the sample size minus 1. And our LOC there is the mean of our sample. So that means the mean of our sample is the mean of our sample and the scale at standard deviation is the standard deviation. And if we do that there's going to be this calculation and we get our confidence intervals for a 95% confidence level. So we need to do resampling every time. I just want to show you that this exists but it's much more intuitive and you have a much deeper understanding of what is going on here when we do the bootstrap resampling and that's what I do quite often. So let's just have an example of the median because this is going to work out slightly different. So what we're going to do below is we're going to have 200, the sample size of 200. And this time around though we're not going to take it from a normal distribution. We're going to take it from a chi-square distribution. So what we're imagining here is that we have a sample of 200 so we don't even know the population here. I'm just simulating some real-world example here. We have 200 subjects in our dataset. And what we don't know beforehand is that we're taking it from a chi-square distribution but we have to simulate these subjects so we're going to use that. And by the way the parameter for chi-square distribution is degrees of freedom as well and we set that to be 10. So let's have a look at the median for our variable in our sample of 200, 9.38. Let's visualize that with a boxplot and we look at that and we see, well that doesn't, obviously it doesn't look like a normal distribution. There's quite a bit of outliers there and as far as away from the median there's quite a lot of values. So the median there is 9 and the minimum is 2 but the maximum goes up way more up to 26. Definitely not a normal distribution there. So we have our population. We can see it's not normally distributed, I should say our sample. Let's be careful here. We have our sample. We can see the data is not normally distributed. So instead of doing the mean we're going to do the median and we have here median of 9.38 as we saw but we have to express our concerns here, our uncertainty in what the true population median would be. So let's go and do that. So once again I'm going to calculate a bunch of medians doing list comprehension. What am I doing? I'm doing random.choice from my var and it's got to be 200 at a time. It's got to be the same sample size but replacement equals true. So once again some of my subjects are going to appear and each one of my 10,000 re-samples is going to appear more than once. And I'm going to assign that to the computer variable medians and let's have a look at the distribution of the medians. Now you're going to see it's a bit skew. We don't have that nice little distribution because our samples, it really wasn't a nice normal distribution from our samples. But we're going to do exactly the same thing. So we're going to determine these two values 4k at the 2.5% percentile and 97.5% percentile because we're going to stick to a 95% confidence level that we're interested in. Remember those will be then different if we choose 80% for instance. So I've got my two values there. Let's have a look at what they are from my little formula. So I want the 250th value of my 10,000 re-sampled medians and I want the value an index 9750, Python is zero index. So of course we want one less than all of those. So we're going to sort them all in order and then we're going to get the 249th indexed value and the 9749th valued one. So those are going to be our confidence levels. So the lower bound is 8.52 and the upper bound is 9.97 and looks like. So let's plot this. There we go together with the median that we found for our study. And this is what it looks like. And you can see it's no longer symmetric. The bounds are, you know, it's not symmetric around the median that we have. And that is what we had before. So there's a bigger gap on this side than on that side. But we're still going to express this in exactly the same way. We're going to say according to our study we found a median of, what was that? Let's scroll back up. We found a median of, we can just go to the plot there, 9.38, 95% confidence intervals. And there we go, 8.522 to 9.97. Exactly the same thing. So I think you should now have a very good understanding of expressing the uncertainty of the results that you found in your study.