 The central limit theorem. Now, it's a very important theorem. It is what all our statistical analysis at least our inferential statistical analysis is built around. Don't have to have a deep understanding of it. Just appreciate it. Sit back, watch. It's quite exciting stuff. As usual, I'm just setting up my CSS style sheet. The environment that I want to run, I'm going to import numerical Python again as np numpy from math. The math library there, I'm going to import factorial. That's the one extra computer code that I want to use. Again, matplotlib, seaborn. You know them now from warnings, filter warnings. And I'm going to do my matplotlib inline and my filter warnings ignore. This is something new, though. Set, sns.setStyle. Now, this is what sns is all about. To some extent, you can do the matplotlib as well. But you can actually import various styles. So if you Google seaborn, have a look at it of setting the style here and setting the context, changing the font scale and the default fig sizes. You can do a lot of things just by setting some defaults initially. You can also set the defaults as you do sns as arguments. I'm not going to go. There's not a course on how to do seaborn. Google it, play around with seaborn. You can draw some fantastic plots. So the central limit theorem. What is it all about? Now, you'll remember from the previous lecture, just looking at probability and area. I said that this curve is always normally distributed, but you're going to gather some data. You draw a plot of the data and you say, well, there's very skew. My data is very skew and comparing the two groups, that some variable between the two groups is very skewed. What is happening here? How can I now draw this nice, beautiful graph that is symmetrical and bell-shaped and get area from it? Because this thing is skew. How's this going to work? Well, the central limit theorem to the rescue. Very simply stated, think about this. Take a very large population. How many people are in the face of this planet? In excess of six billion. So, large. Wherever you live, there's a large population. Let each member of that population have a known random variable for a random value for a certain variable. Let's take age. Everyone is a certain number of months or years old and it's completely random. You can have a random value for age. Now, if I repetitively take a random sample from that population, okay, so there's six billion people and I select it. Well, let's make that much smaller. So there's only 20,000 people in the little town that you live in. It doesn't matter. Let's take a sample of 30 patients. At random, I pick 30 people and I say, what's your age? What's your age? What's your age? And I take the mean of that age group of that 30 and I put that in my pocket. The mean age of my first 30 out of the 20,000 people in my little village, I've got that mean of those 30 people in my pocket. I now chase them out the doors and I randomly pick 30 people again, which might include some of the ones I've had before. I might not. There's 20,000 people there and I'm just taking 30. And I say, well, 30, what's your age? What's your age? What's your age? And I calculate the mean of that little group and I've got another mean in my pocket. It's like rolling the die. Every time I roll the die, I add the two up. I've got a value and that's a value in my pocket. And I keep on doing that and I keep on doing this and I do this for a long time. Take 30 people, get the ages, calculate the average, put that average in my pocket. If I were now to do a histogram of all those sample means, the central limit theorem guarantees that a histogram of all those means I calculated will be normally distributed. Even if the initial data set, the ages of all those 20,000 people are not normally distributed at all, a histogram of the sample means will be normally distributed. Now think of that, any trial you do, you've got some people in one group, some people in another group. I'm going to say people all the time, but it can be anything else. You've got one, a lot of subjects in one, and subjects in another. It can be test-tuned results, doesn't matter. And you compare the difference in means between the two. That difference in mean is but one of many possible differences in means you could have done if you repetitively ran your trial. So yours is just one of countless others. And if I could do all those countless others, I would see that the means or then the differences in means or whatever is normally distributed. I'm going to show that to you in a very visible way. To do that though, we just need to understand something called combinations. So suppose you have five patients. We name them for privacy-safe patients, ABCDE. I'll ask you a question. How many distinct combinations can you make choosing only a pair of two patients? Those two patients. For example, I can choose patient A and C. If I choose patients C and A, though, that's the same combination. You can't count that twice. It's not like rolling the die and one dies, one and the other one is two and then two and one. So combinations means if I take patient A and patient C, that's the same as having taken patient C and A. Let's have a look. I can draw patient A and B, A and C, A and D, A and E, then B and C, B and D, B and E, C and D, C and E, D and E. That's all. Ten. So from only five people, imagine my whole town consisted of only five people and I drew a random sample of two, take their ages and calculate a mean. I can get from those five people I can get ten different means by taking two people at random every time. So there's ten possible ways. There's five people. And my experiment only calls for two samples, a sample size of two. I could have taken ten different combinations. Now, there's a mathematical equation for this. I promised you no equations, but there's a little one. I'm sneaking a little one in. You don't have to remember it. This is for illustrative purposes here. And it says, in factorial divided by r factorial, that exclamation mark is pronounced factorial, times in minus r factorial. What is a factorial again? Well, a factorial of, say for instance, a factorial of four, you just start with four and you come backwards. So it's four times three times two times one. Four times three is 12. 12 times two is 24. 24 times one is one. So factorial four, you can calculate this 24. So you just start wherever you are and you come backwards. It actually comes from expansion, series expansion. So actually the factorial of zero is one. It's just a little, if you don't know mathematics that well, it might look a bit funny, but trust me, the factorial of zero is one, but no matter. So n is the number of patients you have. I have five patients. I want to choose two. The r is two. Let me just show you. n equals five, computer variable, bucket called n. I'm putting the integer value five into it, r, two into it. And I'm going to say combinations. So factorial, remember factorial is a code word I imported from NumPy up there. So factorial in divided by factorial r, times factorial of n minus r. You're never going to do this with your own. I'm just showing you for illustrative purposes. Then the print command. I'm going to concatenate a whole string together there. So I'm going to string the number of possible combinations choosing r, which is two. Patience from a possible n is five is. And then do this little calculation. And look at that. I didn't lie to you. The number of possible combinations choosing two patients from a possible five is 10. Let's ramp things up. Let's put 10 patients in there. And we want to choose three. Boom. Look at that. Things are escalating. So choosing three patients from a possible 10 is 120. Now think about this for a little minute. The thought experiment there. Suppose there are 1,000 patients with acute appendicitis in your town over a period. Now you start your study. You only include 30, the next 30 that walk through the door. In essence, that was just a random sample of 30. You could have started a day later. You would have had another set of 30. You could have done this last year or the next year or whatever. The point being is your city is actually just one of many, many, many, many, many, many, many, many, many possible ones. See how quickly this escalates. I mean, if I were to, say, take just 20. I'm not increasing things by many more. Four. We're jumping to 4,845 possible combinations of taking four patients from a set of 20 patients. It escalates. Can you think 1,000? Can you think of 10,000, 20,000? And if I were to take the mean of my one little sample, and if I could repeat this, I mean, if this was my town and this was the number of people, I could have had 4,845 means. If I were to just draw a histogram of all those means, I guarantee you that histogram will be normally distributed. And the mean of your little sample that you've got will fall somewhere on that graph. That graph will have an area under the curve of one. We can work out from yours out towards one of the two sides, what the area of that would be. And we can now say what your finding or your little experiment statistically significant or not. Don't believe me? Carry on, let's see. I'm going to run a bit of code here. And really, you don't have to learn how to do this. I'll just explain to you what's going on here. You don't have to learn. This is proper Python coding. It's fun though to do it. If you are interested in doing this, you are going to learn something from this little block of code. But this is for illustrative purposes only to show you the center limit theorem at work. I'm just going to have a little counter. It's customary to call that counter I and I'm going to set it to zero. And I'm going to make this empty list. And I'm going to call it AVE, system empty list. And this is a loop. You get different types of loops. This is a while loop. It says while I is less than 10,000. Carry out this code. So you're going to see when you put a little colon behind something, if you hit the Enter line, it's going to create this little space. This is Python syntax. Anything with this tab in front of it. So the code doesn't start directly under the W. It starts there. And I Python notebook will do it for you automatically after a colon. If you hit Enter, it's going to jump there. It's going to run through this as long as this Boolean value is true. Is I less than 10,000? Well, at the moment, it's zero. So it's going to run through this code. As soon as I get to 10,000 or more, it will escape this little loop and start executing the next lot of code. So this is a while loop. So I'm going to introduce inside of this loop this value x, this computer variable x. And I'm going to give it a value. Now, I forget this 10 times. This is concentrated on this little bit there. Np is numpy. And it's got this submodule called random, which has another submodule called random. And the argument there is 40. It says, choose for me a random variable. Actually, choose 40 of them for me. And the random dot random chooses values between 0 and 1. So any decimal value between 0 and 1. And I multiply that by each one by 10. Remember, if I multiply the list by 10, each individual entry. Now, I'm just going to multiply that by 10. I want you to choose for me a random value of, now actually, I should actually be quite correct about this. This random dot random actually creates an array. And if I multiply this 10, it's going to multiply each one of those. So 40 there. So I'm going to choose 40 random values between 0 and 1, multiplied by 10. So I'm going to get a value between 0 and 10. And I'm going to get 40 of them. And I'm going to then append. AVE is empty at the moment. Append means put the value inside of AVE, inside of this empty list. And what do I want to put into that? I want to put in np dot mean. So it's going to calculate the mean of my 40 values there. It's going to calculate the mean of my 40 values there. And it's going to put that inside of AVE. And I'm going to do this 10,000 times. And this is just shorthand for i equals i plus 1. Let me write it out for you, rather. So you see i equals i plus 1. Remember before we talked about this? This is computer variables, not algebra. So it's going to say at the moment i is 0. I'm going to add 0 to 1. That's 1. And I'm going to put it inside. So i is now 1. I'm running through my loop again. It says, is 1 less than 10,000? Yes, it is. I've run through the loop again and again and again and again until I have this i being 10,000. And then we'll escape that loop. But what I'm doing in essence is I'm taking a value between 0 and 10, 40 of them. I'm getting their mean. And I'm chucking it into that bucket again and again and again and again 10,000 times. And now remember this plot? I'm going to choose 20 bins. And this average, which is now 10,000 values long, I'm going to distribute. And lo and behold, there it is, my kernel density estimate. Very nearly, very nearly normally distributed. Very nearly normally distributed. And I want to show you on a proof queue, if I had to say random dot random 10,000, if I were to take 10,000 random values, this is going to be between 0 and 1. I'm multiplying by 10. And I would just to distribute and I would just to get a histogram of that. See, my actual values, just choosing 10,000 random values and just my one sample of 10,000, that's not normally distributed at all. It's actually just totally random. But if I were to repeatedly take 10,000 and I put in the bag each and every time I put my mean for that inside of the bag, inside of the bag, inside of the bag, and I'm just counting all the means. How many times was I, I was much more likely to find a mean of just over 5, just over 5. That was the most common mean I could find. And this is just to prove to you the central limit theorem. The means of all the possible means will be normally distributed. If you take a random sample of 30 patients and you take another random sample of 30 patients, one has got group A, one has group B, one lot has got a mean value. The other one's got a lot of, it's got a mean value. And you take the difference between those means, that will be one mean in the bag. You've got one in the bag. But your two sets of 30 patients are just one tiny little drop in the ocean of billions of other combinations you could have had. And if you could have all of that, which we can never, ever do, I mean no one can do a trial like that, but there is a mathematical way in which it takes your little sample, it's called Z and T distributions, and it can estimate that little plot. See that kernel density estimate there? Yeah, it's got something to do with that. It draws that little graph for us depending on your specific findings. And then it sees where your specific findings falls on that graph. It works out the area under that curve and lo and behold, you have a P value, phenomenal.