 So in this next video we're going to look at distributions. When you collect data point values, some occur more commonly for variable than others and there's some pattern to this distribution. We've all seen the bell shaped curve from the normal distribution where it's nice and symmetric that bell shaped curve, but there's lots of others. More importantly in this video, I'm going to talk to you about sampling distributions and that is going to be at the crux of the matter that allows us to, it really allows us to do influential statistics. And what you're going to see what that is all about is that if you go out and you do some research and you have a bunch of subjects and for some variable you collect all their data and you calculate a mean for that variable. That mean is only one of many possible means because think about it, if you started your study one week later you would have had different subjects in your study and you would have different data points for that variable and you would have a different mean. And if you started another week or if you did your study in some other place, every time you did that exact same study you would have gotten different data point values and a different mean. And all of those means also if we could do a study millions of times over would also be part of a distribution. And that really is at the heart of the matter that the mean that you find or the difference between two means is just one of many, many possible ones. And what we are looking for if we have a small p value is one of those differences that is very unlikely to occur that if we look at a bell shaped curve would fall some way out here. Maybe you only do your study once but there's some beautiful mathematics that allows us to estimate where our one study would fall, where our difference in means or mean or whatever your study is about, where it would fall on some curve. So a very important video of this, I'm going to use the notebook, write some code and you'll see, you'll start to see first of all just distributions, the patterns in your data but then more importantly the sampling distributions. I've named this notebook distributions and if we scroll down let's look at these libraries that we're going to use from NumPy. We import NumPy as NP as always from SciPy, scientific Python, we're going to import the stats module that's a module inside of the library, SciPy library and it contains many, many, many statistical functions. And when we actually start doing some real statistical analysis that is the module inside of the SciPy library that we can use. From pandas I'm just going to import DataFrame and then from ITER tools that is part of Python, it's not something that you have to install separately, import chain. We're just going to use that to what I'm going to call flatten our arrays. There are more than one way to do it. I'm just going to show you this one just for an extra tool in your arsenal and then from the math library we're going to import the factorial function. So I'm going to go ahead and import all of those and then I'm also going to import plotly.graphobjects as geo as we've always done plotly.io as pio and plotly.express as px. And then plotly.figuf underscore factory that's a new one as ff. So we're going to use some of the plots or one of the plots at least that's available in that. And I'm going to say pio.templates.default and I'm going to set that to plotly white so that we get nice white figures. Now the first part here of this notebook, remember you can download it from GitHub and there is a video on this playlist of how to do that. And I'll leave that in the description down below. It's all about probability theory. And this is the basics, things that I would assume are just common sense about probability theory. And we're just tossing a coin there and it landing head up and tail up. And it's really easy, a bit of code to simulate you throwing or flipping the coin many, many times over and seeing what lands face up and tails up and heads up. So just some basics about probability theory that you can go through in your own time because this video is all about distribution. So let's start talking about random variables. And even though I've used the flipping of a coin there and the probability theory, what we're going to do here. And I'm going to talk about is the rolling of a die but not only one die to dice, a pair of dice. We're going to stick to the term dice because it just rolls off the tongue a little bit better than die. Which just sounds better at least. So what we're going to do, we've got our pair of dice, normal dice with six sides, one, two, three, four, five, six. And we roll the one and then the second one and we note on the first one what lands face up. We note what lands up on the second die and what lands face up and we just add those two. And then you can see the 36 possibilities and I think it makes intuitive sense. I can roll a one on the one and a one on the other one and I add that and I get two. So two would be the lowest number and on the other end of the spectrum I can roll double sixes. And we add those two and we get 12. So the possible outcomes here are from two to 12. And this outcome we're going to call a random variable. I don't want you to get confused with what we're doing here as far as the definitions and terms are concerned. Random variable, remember we've used the term statistical variable, a variable. So age is a variable. Your cholesterol is a variable and you can collect data point values for a bunch of your subjects for that one variable. And that one variable has a type. But you'll come across the term random variable. And the random variable actually refers to the actual data point values. They can come at you at random. If you take a bunch of participants in your study, they'll all have different ages and every age is a random variable. What I like to call when I explain these things a data point value. That's a random variable. Random variable, actually the definition goes a bit deeper because it is actually a function. Remember y equals x squared. It's actually a function and it maps some outcome to a value that we can actually jot down. And that's exactly what's happening here. I've got some experiment running. I roll one die and it's the second die. And I add the two and the function maps those two values. I add them and that maps it to something that I'm capturing. So if I roll a three and a four, that's a seven. And I capture that seven and that's my random variable. So just mind these terms. That thing that I capture in my spreadsheet file if that's how I'm collecting my data. That is a data point value. That's the way I like to describe it. But in actual fact, it is the outcome of an experiment. I take someone and I ask them their age and the random variable would be their age. It's the outcome of an experiment. I've taken that person and I've taken their age. Anyway, we leave the definitions there. I think we need to allow it to confuse us. So there they are. I can roll a two, three, four, five until 12. But I am going to now map this function thing. This value, this outcome to a variable and we're going to call that variable X. So this uppercase X, I might have called in a long word with underscores in between as my column header, the sum of two dice, whatever. But just to keep things short, I'm just going to call it X. So X is our random variable then. I can take these values. My sample space elements is anything from two to 12. But we note that certain numbers have more than one way of getting to them. There's only one way to roll a two and that's double ones. There's only one way to roll a 12 and that's double sixes. But there are two ways to get a three. The first die can be at one and the second die at two or the other way around. The first die can be at two and the second one at one. That's two separate things. I mean the die fell on individual ones fell on a different number. So there are two ways to get a three. There are three ways to get a four. There are four ways to get a five, etc. And the most common one is a seven. So if there are more ways to get a seven, we would suggest then that, well, yeah, you are more likely to roll a seven than you are to roll a 12 or two because there's so many more ways in which those die can fall, the dice can fall, that will give you a seven. So if we look at all those 36, we say the probability of X, X meaning the sum of the two faces leaning up of the probability of X being two. There's only one out of the 36 ways to get a two. So the probability of you rolling a two is just one over 36 and the same goes for 12. It's only one over 36. But if we look at the probability of X being a seven, there we go, of X being a seven, that's six over 36. That's much more common. So let's look at these probabilities. I'm going to use list comprehension here. So you see here, I'm going to use these outside set of brackets and I'm going to say I divided by 36 for I in 1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1. That's the ones I have here, the 1, 2, 3, 4, 5, 6, 5, 4, 3, 2, 1 in the end. So I'm just going to run that and that's going to give me these probabilities. So there we go. We see this highest probability is at 16.7% here and that's to roll a seven. 16.7%, we have a 13.9%, then to roll a six or an eight and it goes all the way down to about a 2.8% probability or likelihood we should really say of rolling a two or a 12. And there's a nice hierarchy, there's a nice pattern and the word we're going to use is a nice probability. Now the nice thing about these probabilities also, they sum to one. In other words, you can't roll a one or you can't roll a 13. That's impossible. From 2 to 12, that encompasses the whole lot. It's mutually exclusive and collectively exhaustive is the term that you'll come across. Mutually exclusive, in other words, rolling a 12 and rolling a two, that's two separate things and I include all of them from 2 to 12 in these probabilities and they collectively exhaustive. In other words, that's all there is. There's nothing other than that. So if I sum all those probabilities, I better get to one and because they're mutually exclusive, it means I can also ask what is the probability of my outcome being 10 or more? And that's what we see here. What's the outcome, the probability of being 10 or more? And I just add those probabilities of the 0.083, the 0.555, the 0.031. Well, I should just check that that's a little typo there and you can add those probabilities up there and there you get. We add them all up and we say that's 16.6%. I've got to correct these typos that slipped in here. It's an ideal opportunity to do that. Let's put in there, 16, that's 16, 0.166 and we said this one over here was instead of 0.31, it's 0., I don't know how that slipped in, but let's make it 0.28. And you see how we get these little nice mathematical typeset here. If I put two dollar signs next to something that tells the notebook here that I want mathematical typesetting using what we call LaTeX, L-A-T-E-X. You can also pronounce it latex, I suppose, but LaTeX is perhaps the proper way to do it. And I can add all these symbols. So to have a left parentheses is backslash left and then the left parentheses and then greater than or equal to is backslash g-e and the right parentheses is backslash right and then right parentheses. And I enclose all of that in a dollar notation so that when I do run this, I get this nice mathematical typesetting there. But we see the probability of rolling a 10 or more at 16.7%. Now, let's do something, let's simulate our experiment because this, what we have up here, that's a theoretical distribution and we'll see now there's a difference between a theoretical distribution and an empirical distribution and the empirical distribution is the one that you actually found when you did your study. So this is our study. We're going to roll this die a couple of thousand times. If I scroll up there, we see 10,000. So what I'm going to say is np.random.seed and I'm going to use three. If you use the same integer there, you're going to get the same random values back. I'm going to lock this all up in a data frame. My data frame is going to have this one column, the roll total, and I'm using, as you can see here, I'm using a dictionary. Don't worry, just look at this code. You just copy and paste it and play around with it yourself. This is not the type of code we're going to do when we do the actual statistical analysis. I've been promising this for so long, but we'll get there. So I'm using list comprehension for my 10,000 rows of data. I'm using random integer between one and six. So the highest seven, remember the highest is excluded and I'm rolling two of them at a time. So give me two random values back between one and six. It's exactly what we're doing with the die and that's all past as an argument to the np.sum function. So sum those two, roll two, sum them, roll two, sum them, and convert that to a list so that we can just have this list of items and that is part of my data frame and I'm doing that for I in range 10,000. So from zero to 10,000, excluding the 10,000, but that gives me my 10,000 values. So let's just look at the head of this series. It's more of a series than a data frame because I only have this one column and there I see my first row was a four, my second row was a six, then a two, then a seven, then a 10. If you use that same seed here, you're going to get exactly the same values. If you don't put the seed there or use a different number, you're going to get different values. So let's look at the value counts method. So rolls underscore totals. That is my data frame name dot roll total. That's my single column. So I'm getting back a series and I just created it as a series anyway. And then I use the value underscore counts method and I set that to normalize equals true because I want back the probabilities. And there we go. This is our empirical distribution. Our empirical distribution, that is the distribution of our actual 10,000 values versus the theoretical distribution. Now I showed you a way before of how we can sort this the quickest way because my numbers that I'm rolling, I created it as a series so that the index is actually the total values. So if I just say dot sort index, if I use the dot sort index method, I'm going to get them back in a sorted order. And there you see 2356789212 and there you can see the actual, how many times of that 10,000. So you can see there the 16, 1699 and that's very close to the 16. What was our theoretical distribution? They're 16.7. So we get a 16.9 or 17 there. So we close our empirical distribution as we close to our theoretical distribution and the larger that 10,000 number, the more we make them, the closer it's going to get to the theoretical distribution. There we go. Let's just create a quick bar chart of this. I'm going to use plotly express and we imported it as px and some using the bar function there. My x range is from 2 to 13. Remember the 13 is excluded so that'll just be 2 to 12. My y, I'm going to put this, how many of these there were as far as my bar chart is concerned. So there we go. A beautiful distribution. You see the pattern? The 7 occurred most commonly and the further away you get from 7 on both sides, the less commonly they occur. So there were 282 2s and on the other side, there were 299 12s but the 1699 7s. That is a distribution. Your, the random variables came in a certain pattern. Some were more likely than others and that's why you get games of chance with dice based on the number 7. There are games like that. So let's look at these random, a bit closer or more examples of this random variable distribution. So I'm going to create a computer variable called height. There we go again height. I'm going to use from stats, I'm going to use dot norm dot rvs. Now this is beautiful, this rvs function and as much as it allows us just to create random variables. As if we were doing a study, it's not always possible for you to go out and get data on some subjects or participants. Sometimes you just want to simulate some so that you can exercise and this is a great way. You can try and read up on this rvs. So stats dot norm, the stats has many, many name distributions. One of them is the normal distribution. So I'm going to say stats dot norm and then the rvs means give me back some random variables from this distribution and I can set a few arguments there. LOC, lock, that's the mean. So I'm saying give me a mean of 160. A scale of 10, that means standard deviation. Standard deviation of 10 and I want 200 of them and I'm setting a random state equals 1 so if you do the same random state you're going to get the same 200 heights back. So I'm simulating here the height of 200 people and so that the height follows a normal distribution, that's the bell shaped curve, we'll look at that, with a mean of 160 and a standard deviation of 10. Okay, boom, it's as easy as that. I have my 200 values. Let's create a plot of this. I'm going to call my plot height underscore his for height histogram and we're going to use that figure factory. One of its plots is to create underscore this plot. Very nice. So I'm going to pass height. I've got to pass a list of lists. So there's the square bracket there and the first list that I'm passing to it is height. It is a numpy array, actually it's not a list. So I'm passing that as one element inside of a list. I'm going to call it height and my bin size is, I'm sitting there to 5. Let's plot that. Let's create this plot. Let's call it distribution plot. So what you can see is the normal histogram there and it seems most people were about 160 marks. As we said we wanted that mean. Because we've set the bin size to 5, it's actually grouping too many people into one little bin and you can see that the bin is actually 5 years apart. But what you can also see is this nice idea of this smooth curve and this is based on the data itself. So it's not a theoretical normal distribution. This is an empirical distribution. It's called a kernel density estimate and it just uses some mathematics just to smooth out this curve so that we get this idea of the spread in the data better than we might just get from the histogram. And what you see at the bottom is a rug plot RUG. That gets added as well and that's all the individual 200 values that we have. We can see they densely densely around the 160 mark and they further out. And because we've used the norm function here, that's not norm, it was taken from a normal distribution, that's exactly what we would expect. And here I talk a little bit about the empirical distribution and the theoretical distribution. So this is our empirical distribution, the actual data versus some theoretical distribution. There are some moments about these distributions. You can read a little bit about them. I've written something and a little bit about skewness and you can certainly read up on that. We're not going to use that when we do our actual data analysis. I don't think that's quite essential but I've put that in there for you if you want to read about it. So let's get to some actual work. We're going to look at the theoretical normal distribution. And the one we're going to start off with is not just the normal distribution but the standard normal distribution. Very special distribution in statistics. A lot of what we deal with are based on the standard normal distribution. So the standard normal distribution has a mean of 0 and a standard deviation of 1. It's very prototypical. I'm going to create two mystery computer variables here, lower and higher. And the lower is stats.norm.ppf 0.01 and the higher is stats.norm.ppf 0.99 I'm going to set that to lower and higher. I'll keep it as a bit of a mystery. And you can see it's the symmetrical about 0, negative 2.326 and positive 2.326 on the other side. So symmetrical around the zero mark. And you can start thinking to yourself what this is going to be about. Remember I used the term mutually exclusive and collectively exhaustive such that if I combine all my probabilities they'll sum to 1. Now that was okay for discrete variables like the rolling of a die but what about continuous variables? Remember with a continuous numerical variable I can't say that 0.4 means anything because it can be 0.3 999 476 very close to 4. I can infinitely divide my numbers. So an actual single number makes very little sense. It is more about between two values and you'll see what I mean a little bit later. But look at those numbers 0.01 and 0.99. What I want you to start thinking about is think about this bell shaped curve in your head and imagine we can calculate the area under the curve and you might remember that from calculus that's the integration but we're not going to do integration. But think about the area of a circle by our squared or the area of a cube one side squared or the rectangle the base times the height. Every area, geometrical area has a geometrical shape, has an area. But this nice little curve we drew here with this plot it also if I looked at the area under the curve we're going to do the same here with this very nice curve for the normal distribution, standard normal distribution and the area under the curve is going to equal 1, 1.00 and remember that's just the area like the area of a triangle or the area of a circle, it's a physical area and this nice little bell shaped curve that we're going to draw from what we're going to call a probability density function is going to have an area under the curve of 1.00 and if you think about it goes down to the x-axis on both sides, it gets flatter and flatter the further away we get from the bulky middle this 0.01 and 0.99 that's going to if you think about it it's going to be some marker on that bell shaped curve for which to the left of it I have 1% of the area under the curve and on the other side if I go from that towards positive infinity side I'm going to have another 1% of the area under the curve okay, just have that in your mind now here again this is nothing to do with a curve that you're eventually going to do I'm just going to create this values variable computer variable and I'm doing from lower to higher so from negative 2.326 to positive 2.326 and I want 100 values in between that I'm going to flatten that we need to flatten it and that's why I use this iter so chain dot from iter values and that just flattens it as a single list of values you needn't worry about that we could also have just said values dot flatten see there values dot flatten that would be another way to do it and I'm just going to show you the first 5 values so we started negative 3.26, negative 2.279, negative 2 so I've just divved up from negative 2.3 to positive 2.3 into 100 values and I just want them as a list that's all this thing is here is doing just flattening it as a single list but I could also just have said list and then values dot flatten open close parentheses and now that I have these 100 values I'm just using them and I'm going to create this probability density function values for them so I'm going to say pdf underscore values equals stats.norm.pdf values I'm going to flatten it again by this using the iter tools but as I said you can just use dot flatten and now what I get for every value here I now get the corresponding value for this probability distribution and while it might not make much sense these words that I've done here this is what I get from there this is what it's all about so if you look very closely you see it's just a bunch of little straight lines but if you move far away it looks like a smooth curve the reason is I've only simulated 100 points here that look like a curve and there's the standard normal distribution and you can see most values are at 0 and the further away we go the less likely a value would be if I were to draw randomly from this distribution and what we had with a negative 3.2 and the positive negative 2.36 positive 2.36 were just these values here as cutoff because I've just asked it to plot from there to there but think it extends really to the left side and extends to the right side such that this whole area under the curve from negative infinity to positive infinity is 1, 1.00 100% of the area under the curve but if I were to just go to this little point here on the left side and this little point here on the right side to the right of this would represent 1% of the area under the curve to the left of this would represent 1% of the area under the curve and the area under the curve from this point to this point is 98% 98% so that's what that .ppf is doing it just gave me a little Marcus on the left and right hand side just to choose between now let's have a look at this figure it's a long figure that I'm going to do here for the normal distribution and again I'm just going from this negative 2.326 and positive 2. what I want to show you though is this imagine that I do as a research study and I get two groups of participants one group got one intervention and the other group got a placebo intervention and I have some variable and I calculate the mean for each the difference in the two means that difference is going to fall some way on this distribution here I've simulated that the difference between the two groups the means of the two group is negative 1 I've got to be slightly careful though because if I subtract the lower from the higher I'm getting a positive value but I could just as well who said the placebo group has to go before the active group I can subtract either one from either one and then I'm going to end up with a positive or negative number and that's just by the way which I decide which to subtract from which and that's why we've got to actually reflect this red line on the right hand side as well so you can imagine here on the plus 1 that there is a line going up here as well as well as the negative 1 and what we now say is if we look from this negative 1 towards the left and from positive 1 towards the right and we look at all this area under the curve to the left of the red line and to the right of the imaginary red line on the other side that is my p-value that's actually what a p-value is it's an area under the curve so what we're going to do is we're going to have this theoretical distribution that we're going to work with with students t-test we're going to use the t-distribution I'm going to show it to you here we just have the standard normal distribution we can see that our difference in means was minus 1 or minus whatever we plot it on our chart we reflect it on the other side and we work out the area under the curve towards the outsides and that's a p-value that's exactly what a p-value is so I'll let that sink in a little bit it's actually as simple as all of that now to calculate this area under the curve we're just going to write the line of code you're never going to do all of this but what we do is we create a cumulative distribution function so you see the CDF and let me show you what the CDF is going to look like when we plot it it just starts counting from the left-hand side so we start here up here with nothing of the area under the curve and as we go along we pick up more and more and more and more until we get to the end which is then 100% of all the cases that's a cumulative distribution function and from something here like never good of one we can just read off exactly where it is there and you see the 0.156 that'll be the area under the curve to the left of this negative one and what we usually do here or what we do here is we just multiply it by 2 because we have symmetry on both sides we're going to multiply that by 2 and that's our p-value and that's exactly how p-value is calculated so I say stats.norm.cdf for negative one that's 0.158 I multiply it by 2 because I've got the symmetry on the other side and that's my p-value if my difference between the means followed a normal distribution and I found a difference between the two groups of negative one I would get a p-value of 0.32 and later we'll see that is not statistically significant because we want one of those rare outcomes so you've learned a little bit more now it's all going to it's all we're going to reiterate this by just using the actual statistical analysis and all of this will come back to you a little discrete distribution the one that I do want to show you is the binomial distribution because it's quite a bit of fun I'm not going to go through these functions that I've created but what we're going to talk about with a binomial distribution is that there's a theoretical outcome that is binary it's only two possible outcomes sample space only two elements so imagine you're taking a drug and we're just looking at did you get a side effect that the participant get a side effect yes or no and that's all we do and to do this you see the little I've created two functions that do exactly the same and they're based on this equation for the probability so how it works is the sentence I want to show you let's calculate the probability of two side effects among ten people given a probability look at that horrible spelling of a side effect in any individual of 0.05 so just think about it I've got this drug and we know that if someone takes that drug they have a 5% probability of getting a side effect every individual 5% so studies were done efficacy or safety studies were done before given to a couple of hundreds of people and we notice that 5% get this so my research question is I take ten participants if any one of them individually has a 5% probability of getting a side effect that's 5% so my question is what is the probability that out of those ten people two of them have a side effect that's a valid question and from these functions that I've created there I put that in my binom I've got ten people I'm interested in two of them and any one of them had to have a 5% likelihood of getting this and I can work out the probability of two of them getting it and the probability is very low it's only 0.75 7.5% I should say it's 0.075 7.5% probability of two amongst the ten have a side effect it'll be different if I wanted 3 or 4 and that's going to get less and less because each individual only has an 0.5% but you can see I don't just multiply 0.5 or 5% by 2 now I say there's a 10% that's not the way that it works that's not the way a binomial distribution works by the way you'll come across terms like failure and success success is the thing that we after and it's the probability of success that we put there but success and failure has nothing to do with the actual words English words, their definition success is the thing that you are interested in so it might very well be that I'm looking at survival and death and the thing that I want to investigate is the probability of death success would then be the death so don't read anything into the term success and failure it's just that the outcome the success is the probability that I'm concerned about so look at this below so I'm creating a list object using list comprehension looking at the probability of 0 1, 2, 3 and 4 side effects in the 10 individuals so what was the probability of none of them none of those 10 people getting the side effect given that each individual's probability is 5% what is the probability out of those 10 that only one person gets it, what's the probability we've already seen that 2 people get it that 3 people get it, that 4 people give it and I can use the binomial distribution for that so let's just have a look this is just for interest sake and you see that the highest probability is for none of those 10 people to get it and that's at 60% there's a 31.5% probability that one person gets the side effect and then 7.4 as we've seen 7.5 for 2 people it really goes down the chance is that 5 people get the side effect having only a 5% chance for each individual is very very low so let's have a look at that and I've plotted some of them and you can see the K here that is the binomial distribution so it's actually quite a bit of fun and you see some of the moments there for the binomial distribution we now come to the most important part of this lecture and that's sampling distributions and that's going to tie everything for us together as far as how p-values really work so what I'm going to do is I'm going to create a population so I'm going to have some delusions of grandeur and imagine that I have my own brand new planet and I am the ruler of that planet so much so that I'm a deity and I create everything on that planet on my planet there's only 10,000 people and all beings that I'm creating and they are they are all they have a certain height how tall are they so I'm going to say population underscore heights my computer variable and I'm going to simulate them being a height of 140 centimeters to 200 centimeters and remember this is a uniform distribution so 140 is just as lightly as 146 is just as lightly as 170 I don't have a mean and a standard I don't have a normal distribution here so that they all clumped up in the middle so it's this equal spread over 10,000 and an empirical distribution it's going to not completely be a uniform distribution but very close so there's my 10,000 beings on my planet and you see they're different heights and you see it's not a bell shaped curve some heights are just as likely as other heights very close to a very uniform distribution this is very close to a uniform distribution so I am now going to do some tests some studies on my 10,000 beings and what I'm going to do I'm going to select 100 of them at random and I'm going to measure their heights and I'm going to save the mean of those 100 heights and then I'm going to chase them all back to their village their villages and tomorrow I'm going to gather another 100 of them at random so some might be the same as before it might all be brand new 100 and I measure their heights and I capture only the mean of those 100 heights and I chase them back to all their villages what a colourful example I'm coming up with anyway and I do this a thousand times over I'm in charge of this planet I can do what I want I have a thousand means that's what I've done here with less comprehension, mean heights NP random dot choice I take from the population 100 replacement is false in other words that 100 is 100 individuals I don't take one individual and throw them back in the pool so that I can choose them again but at the end of the day after I've done my 100 they all go back in the pool because I do this a thousand times over so there I'm simulating there my thousand heights now remember this comes from a uniform distribution it wasn't like most of them 170 in height and so now I have these a thousand heights, a thousand means in other words I have a distribution of a test statistic the mean is a test statistic it is something that I've calculated from a sample and I have this distribution of means now I wonder if you can guess what's going to happen here so you can see there from the actual beings out there on my planet they are just as likely to be 2 meters tall than they are to be 140 meters tall they're not bunched in the middle and I take a hundred of them calculate the mean another hundred calculate the mean and I've just now got a list of a thousand means that's a sampling distribution distribution of a test statistic not of the actual values themselves and you might have guessed what this is going to look like but let me draw this as a histogram for you and here we go look at that there's a different pattern to this this is no longer uniform distribution my sampling distribution of test statistics in this case is a mean is almost normally distributed from something that was never normally distributed before that is amazing and that is part of what we call the central limit theorem and it allows us to do a lot of inferential statistics and use parametric tests that and here's the crux of this whole video the whole video is that you only get to do one study and imagine that study then calculates the difference between two means if you could do the study ten thousand times over or a million times over which is financially impossible time constrained impossible you get to do your study once but your finding is one of many many many possible ones some of which are much more likely than others no matter what I say that in quotation marks in inverted commas you're going to have your test statistic your difference in means is going to be one of many and it's usually going to be based on this normal distribution and what you are hoping for is the one that you find is one of the rare ones and if it is one of the rare ones all of this is going to be an area under the curve all of this is going to sum to one and yours is going to be one of the unlikely ones and then you have a small p-value and you say well there's a statistically significant difference between those two groups as simple as that so the distribution that we've seen before the standard normal distribution we use that in what we call the z-distribution and we change it from standard normal to the term z when we are referring to the fact that we are looking at test statistic distribution we're looking at a sampling distribution of sample statistics then we change the word standard normal to z-distribution but it is really going to be based on the z-distribution my difference in means are going to fall somewhere I duplicate it on both sides replicate it on the symmetric other side and I calculate my area under the curve that cut off towards the outside now the z-distribution does require that I know the standard deviation of that variable in the population now we 7 billion people on the planet and for most variables we don't know what all 7 billion values are so in that case we make use of the t-distribution from William Gossett in the beginning of the 1900s he worked for the Guinness Brewing Company he worked out some statistical tests for small sample sizes and the Guinness Brewing Company didn't want him to let all the secrets out the bag because they were using that in their own business but he persisted and he wanted to publish academically and they said to him well you can publish but then it has to be under pseudonym and he chose student they might have given him student and teacher as two possible ones to choose from whatever the situation might be he chose student as his pseudonym and hence we know the student t-distribution or students t-test his real name was William Gossett then from the Guinness Brewing Company so I'm going to use stats.t.ppf in this sense and not stats.norm but t and the beauty of his mind and what he created was this idea that we don't care about the deviation out in the population we only care about how many participants are in our study so based only on that do we draw this nice bell shaped curve so yeah I've got to pass two arguments the 0.01 and 0.99 just to get those two cutoff points of 1% and 99% but it's based on I want those value based on there being 30 participants in my study so I've got to say 30 and I'm going to create that same 100 values and I'm going to create the PDF values and then I'm going to plot them and there you go and you see the t-distribution looks very much like the normal distribution but it's something we can use when we only have 30 participants in our study or fewer and that's the beauty of it all now when you get to more than 30 these two distributions are very close to each other and when you have really large numbers you might as well use the z-distribution t-distribution but by and large we stick with the t-distribution and let's plot this one here and I'm comparing the z's and the t's and there you go you can see they're very close to each other except when we get to the tails here I mean it's small here on the screen but that starts to get quite significant and if we talk about p-value of 0.05 this gets a bit crucial so we usually stick to that which brings us to this idea of the confidence intervals so we might say that we do a study we capture cholesterol on 100 people and we express some mean for that group and let's say that the mean for our cholesterol in international units was 7.4 but I said to you we're talking about inferential statistics so you want to infer your results on a larger population or someone else's population and you want to know what the real if you were to look at all the people what do you think the real mean was compared to the mean of your 100 participants in your study so we have this plus-minus idea so the 7.4 plus-minus 2 for instance and that is what we call a confidence interval and if we say a 95% confidence interval you've seen that many times in the literature we create this bounds of 95% either side of 7.4 and we suggest that the population mean would be between those two upper and lower bounds we are not 95% we are not 95% confident that the real population mean is between those what it really means is if I were to do my study 100 times over every time I'm going to get a slightly different mean and I'm going to get slightly different confidence intervals upper and lower bounds in 95 of them the real patient population mean would be within those limits that I've said and in 5 of them it won't be you don't know the one that you have whether it is or not so you can't say I'm 95% confident of that the real population mean is between these two values no no no you just work like that if you could do it 100 times over 95 of them will be slightly different will actually have the real population mean in it and 5 won't so how does this confidence interval work does work but we take our mean and it's plus minus some value and we do this value by the z over t and I say the alpha divided by 2 and what that means think about this nice bell shaped curve I want 95% confidence intervals which means 100 minus 95 means I've got 5% left and if I divide 5% in 2 I get 2.5 on the one side and 2.5 on the other side and those are the red lines I can draw minus the one side and positive the other side such that on the left side 2.5% of the area under the curve will be on the one side towards negative infinity 0.5% will be on the positive side to positive infinity that is how I get 95% of the area under the curve in the middle and we multiply that by the standard deviation of our sample divided by the square root of how many people or participants are in our study so let's get the norm PPF of 0.975 and I say 0.975 because 97.5% I add 2.5% to that I get to 100% so that will be on the right hand side and you see that it's 1.96 you'll come across that value many times 1.96 1.96 standard deviations away from the mean of a standard normal distribution or Z distribution then I should say will give me 2.5% of the area under the curve towards the right and negative 1.96 on the left towards negative infinity will give me the other 2.5% such that combined I get 5% of the area under the curve which means in the middle I've got 95% of the area under the curve so let's draw that and you can see there that's exactly what I've done so we've done this here for the Z distribution in other words we have mean of 0 in a standard deviation of 1 but you can see there where the two cutoffs are so that to the left of this red line the area under that curve will be 2.5% of the total and to the right will be 2.5% of the total so that in the middle I have the remaining 95% of the area under the curve and that's where we get these values for 95% under the curve so here we've done it and I'm multiplying by 2 and I'm dividing by the square root of 60 imagining that I have 60 participants in the study and the standard deviation of 2 can we calculate the confidence value of 95% so there's my confidence level 7.05060605 etc now I subtract that from the 7.4 remember study had a 7.4 mean for our cholesterol and I added to that and now I get the bounds for my 95% confidence interval and let's run this one as well I should print to the screen there and there I see 6.89 to 7.9 so in my study I can write 7.4, 95% confidence interval 6.9 to 7.9 a 95% confidence interval of 6.9 to 7.9 and that would be exactly the same as what we did here we just converted it to that we do have the 7.5 in the middle instead of the 0 but that's exactly how we will do it again we won't have to do any of this this is just to explain to you what we're doing we are just going to write a simple use a simple function inside of sci-pi.stats and it's going to do the confidence intervals for us we need to do this or some other function I'll show you a couple of ways to do that so let's do just this I just want to show you here for the t distributions I'm going to say stats.ppf t.ppf and because I have 60 participants that leaves me with something called 59 degrees of freedom and I won't go into what degrees of freedom are and just take it for now that is true but you can see with 60 participants and a 7.4 mean I'm going to get to almost exactly the same or very nearly exactly the same confidence intervals where I use the z distribution or t because I've got 60 now it's more than 30 so these two distributions get very close to each other but the same thing applies so there you go I hope that was an explanation that meant something to you and even if it didn't completely once we start doing the actual statistical test which is getting closer and closer by the way these things will come back to you and you'll have this concept in your head very clearly that p-value is nothing than an area under the curve the difference between two groups that we find or whatever measurement we do is going to be one of many possible ones and what we're hoping for is one of the rare ones and it's all based on this mathematics of the theoretical distributions the z distribution, t distribution we're going to get chi-square and all sorts of other distributions and the result and we can calculate all of that purely based on the small sample that you have but yours will then fall somewhere on that and we can calculate an area under the curve for that which is going to be our p-value