 In this lecture, we're going to talk about the very important topic of confidence intervals. You see them all over the show in published research. Let's start having a look. As usual, I'm just going to import my style sheet. We'll run that. Just make things look pretty. And let's look at what we're going to import sitting up our Python environment. I'm going to use pandas. That we know well by now, I'm going to import this. PsiPy, remember that scientific Python, that module, it's submodule stats. And from that, I'm going to import this base underscore MVS, a functional method there. This is another new one. Import PsiKits.bootstrap. There's a lot of PsiKits modules, scientific kits. Bootstrap is the one, the submodule we're interested in. And we're going to import that as an abbreviation, BS. We're going to import matplotlib.pyplot as plt. We know well and import cborn as sns. And as per usual, our warnings from warnings. We're going to import filter warnings so that we can do our magic command there and ignore the filter warning. That's just my standard. You need not do that. You'll see here for sns, for cborn, I'm just setting the styles as well. And you can really look at this website here. There we go, beautiful tutorials on the use of cborn and how to style your plots or your figures. Let's just run that to make sure that all of those things are imported. Our Python environment is now set up properly. We're going to import again our mock data, which is a spreadsheet CSV file. I'm going to import that as pd.readcsv. So the pandas module, the readcsv method. So that's going to read this, put it inside of this computer variable data making this data a pandas data frame. Just to be sure that things imported correctly, I'm going to say data.head to the head method there and just call up the first three rows just to make sure things indeed imported properly. There's our first three rows, no problems there. So let's start discussing confidence interval. What is this confidence interval all about? Well, what we're dealing with in statistics, specifically inferential statistics, is that we cannot analyze the whole population. So we're going to take some sample data from a population, whether that be in the lab or actual human beings. We're going to take some sample data. We're going to do analysis on some variables. And based on the results that we get, we're going to infer those results on the general population out there. Now, because we're just dealing with one sample, we can't be very sure for whatever statistic we're dealing with, a statistic for that variable, what that parameter would look like in the population out there. Therefore, we can construct 95%, 95 is the usual one, or we construct confidence intervals. So for instance, we're dealing with a mean. We can have a group, a sample group. We can calculate the mean for variable. And then above and below that mean, we can calculate two values, which will be the lower bound and the upper bound of our interval. Now, I say 95%, as I said, that is the norm in healthcare research, but you can do 80%, 90%, 99%. Now, what you would usually see, or quite commonly you come across the following, we are 95% confident that the population mean value for variables so and so is between this and that. Now, that is unfortunately incorrect. Things are a bit more subtle than that when it comes to confidence intervals. What a confidence interval means is if you were to repeatedly take samples, independent samples from the very same population and you calculate this confidence interval for each sample, each time you do this. Let's make it the 95% confidence interval. If you could do this many, many, many, many, many times. In 95% of cases, the true population mean would occur in the confidence intervals that you have calculated. So in 95% of those ranges, 95% of the cases, the true population parameter will occur between whatever was in that set. So it is not just yours, the confidence intervals that you calculate, a lower and an upper value, 95% confidence interval that you calculate. It does not mean that the true population mean is between those and you're 95% sure of that happening. No, I reiterate, if you were to repeat taking independent samples, calculating, they mean calculating confidence intervals and you do this a million times. In 95% of the cases, the true population mean will be in the parameters, the confidence interval ranges that you've calculated. So let's just get some examples done. Now that you understand what the confidence interval really means, let's just look at doing some examples. Now our mock data set there, let's extract from that only the patients that actually had acute appendicitis. Some of the patients were operated on and their appendixes were found to be normal on histological evaluation. So what I'm going to do, I'm going to take my data, data frame there. From that data frame, look at the data listo column. So remember that's how you have to do it. You have to call the data frame, square brackets, and then to get the column to do this, you have to write the whole thing out, the data and the column with its name there in quotation marks, equals equals yes. I'm asking a question, look at that column. Does that row, because it's going to run through every row, every row, does that column have a yes in it? Yes, if so, if my Boolean value is true, then please include it in this new data frame that I'm making. I'm calling it acute underscore APP, you can call it whatever you want. So I've got this new data frame, so it's going to extract only those people who actually had appendicitis. So from this new data frame of mine, I want to look in the RVD column, and I want to extract all the nos and all the yeses. So I'm making two new data frames, an RVD meg and an RVD positive. Out of this new data set of mine, my new data frame that only contains the appendicitis patients, I want to look at that data frames RVD column, only the rows that contain no and only the rows that contain yes. So I've got two new data frames here that I want to play with. Let's just look at the first five rows for the RVD negative. First of all, you'll notice that in the HISTO column, it'll only be yes patients. And then in the RVD there only no, and if we look at the positive, again, the HISTO will just be yes, but the RVD column will contain yeses as well. So this is one way to do it. I've constructed new data frames. As I said, the other way to do it is by the group by method, which is more powerful, but I want to show you how to think about creating new data frames and just extracting the rows that you are interested in. Please take cognizance of how exactly to do the syntax here of extracting or making new data frames. Let's just describe our data. The RVD negative from that new data frame of ours, the white cell count, let's just describe that. We see there were 78 patients. The mean white cell count was 14, and we see the standard deviation there, and we can do the same for the RVD positive patients. There we see there were 40, and they actually had a higher admission white cell count. Let's construct now a 95% confidence intervals, and this is this method base underscore MVS. Remember, we imported that from scipy.statts. We imported base MVS because we did it like that. We can use it directly. We don't have to say scipy.statts.base underscore MVS. What do I want the 95% confidence intervals of? Well, the RVD negative data frame, it's white cell count. Now you see this method dot drop in a open-close parentheses. Drop in a just means ignore all rows that contain not numbers. So if it finds a non-numerical value in that row for this column, it will just ignore it. It's a very nice thing to use because if you were to have thousands and thousands of rows, you can't with your eye run through all of them and see which ones were incorrectly, the data was incorrectly entered into the data set, and you just want to ignore that, otherwise you're going to have problems. So this base MVS takes two arguments. There's my first argument. So look in this data frame, this column. Please ignore all non-numerical values. What do I want? 0.95, 95% confidence interval. Now this base MVS is going to give you nine results. You're going to get nine values from this. Let's run it now. Take you through what you get. You get this matrix three rows and three columns. This first row is the mean. First of all, you're going to get the mean 14044, which you can also see it was there, 14044. That was quite correct. And then it gives the lower and upper bounds of the 95% confidence interval, 1306 to 1502. Then it gives you the variance, and it's 95% confidence interval bounds, and then it gives the standard deviation there, and it's 95% confidence interval. So you can get the confidence intervals for the mean, around the mean, and the variance, and the standard deviation. Let's look at the 95% confidence intervals for the RVD positive patients. You'll see they're 1419 to 1735, again for mean, variance, and standard deviation. Now let's look at something else. Let's look at the 80% confidence interval. Now do you think the values will be closer? We've taken the positive patients, so we had 14.19 and 13.37. Is that range going to be smaller or larger? Well, think about it. Think about area under the curve. Think about a bell-shaped curve. If I want to only include 80% of the area under the curve on both sides of the midline of the mean, of course, 80% is smaller than 95%, so those two cutoffs are going to keep closer to the midline. So our range is going to be slightly smaller and indeed we've gone from 14.19 to 14.75 and from 17.35 to 16.79. So if your confidence interval goes down, your range becomes narrower. Once again, just for repeating the explanation, if I were to do this experiment, we've got a certain sample set of patients in this study, if I were to do this over and over and over again, independently taking new samples from the same population and I were to calculate confidence intervals again and again and again and again. It's taken 95%, so I would do this again and again and again and again. I could then say out of the million times that I repeated this, 95% of those repeats will have the true population white cell count inside of those. It is not that I'm 95% sure that the population white cell count out there for appendicitis and RVD negra patients is between those two. That is not what the confidence interval says. Lastly, I want to introduce you to what I think is a very nice way to do confidence intervals and that's by bootstrapping. Now think about bootstrapping this way. There are billions of people on the face of the planet. We just took a sample from them and we're dealing with that little sample. What we actually wanted to do is to do all of them. Fortunately, the central limit theorem says that if I do it enough times, everything will be bell shaped. I can do my calculations on that. But the central limit theorem has a little bit of a suggestion in it there. Imagine I could do this many, many times. Well, I can't. What I can do is play with the data that I do have. I do have a set of data. So what does the bootstrap method do? What does it do? Imagine you have 30 patients or samples or whatever, your sample size is 30. And imagine you have one variable and you write down the value for that variable for all 30 cases. Write it down on a small piece of paper. So you're going to have 30 small pieces of paper and on each of those small little pieces of paper you write the value. Also you have all the values and you throw that all in a bucket. And what you're going to do now, because you have 30, you're going to draw 30 samples from that bucket of 30. Of course if you do that one by one you're just going to have the same values all over again. So what you do very cleverly is you draw one on a separate piece of paper you write down the value that you got and you throw that back. Mix it up again and draw again. So there is a chance that you'll redraw that value again. And most definitely if you do this 30 times because you've got to stick to how many you originally had you are going to redraw some of them but that is a random sample. And you could do that a couple of thousand times and from this you can construct a 95% confidence interval because now you have many, many, many, many, many sample sets. Now in the days before computers that was very difficult but now look at that line of code. Remember BAS was my abbreviation for bootstrap, scikit-stop bootstrap.ci method confidence interval. I'm going to do the RVD positive data frame the white cell column. I'm going to do the drop NA open-close parentheses method there. So I just ignore that and now I'm going to say comma alpha equals 0.05. So this one doesn't take the straightforward 0.95 it takes the alpha value. The alpha value is 1 minus the confidence interval or 100% minus the confidence interval so it's 1 minus 0.95, which is 0.05 and the third argument that it takes is n underscore samples equals and I can now say 10,000 do this 10,000 times, do this 20,000 times. So this is going to happen at random so every time you run this you're going to get slightly different values. So let's run it once and we see 14.2 to 17.27 and every time you run this command let's just run it again. You'll see I'm going to get slightly different 14 to 1, 17 to 5 the upper one was very nearly the same but that's going to differ very ever so slightly from the ones that you've got up here because this is a 10,000 time re-sampled calculation and it's done in an instant. I hope you have a good understanding of what confidence intervals are and how easy it is to take your sample data frame, your dataset, make it into a data frame look only at, create some data frames of any of the data that you want and calculate your confidence intervals very easily by either of these two methods writing a few lines of code and hitting ctrl enter or hitting shift enter or shift return easy to do.