 The Z and the T distributions, now we've just seen the central limit theorem, it said that I did some analysis and I calculated a mean value, but my mean of my 30 patients was just one of many other possible ones which will fall on this distribution. And the central limit says that distribution will be normally distributed, it will be bell shaped, bell shaped curve. Now the two main forms of distributions that I want to confuse you, we're not talking distributions normal, Poisson, hyper geometric here, I'm talking about that little curve that the computer can draw that will pertain to your specific research that you're doing and where yours will fall. Now let me run through that again just to make it very, very clear. Then you are just looking at comparing the mean, the means of two groups, you've got two groups of patients and you're comparing their mean. What you're actually doing is just comparing the difference between the two means, you would have 30 patients in one group, 30 patients in the other, they each have a mean, what is the difference between that mean, that difference. If I could now do this study over and over again making two new groups, two new groups, two new groups, two new groups and I calculated the difference of means between the two groups, if I can take that difference and I started plotting it, plotting it, plotting it, that distribution will be normally distributed and that distribution comes in two forms or the two equations Z and T and I'll explain to you which ones those are. I just, you tell the computer which one to do and it really is about the situation that you are in, it'll draw that little graph for you and it'll see where yours falls and work out the area under the curve. Let's go. We're going to run the style sheet as always and just, I'm going to show you, I'm going to import numpy as NP, numerical Python as the abbreviation NP, pandas as PD, from scipy I want to import stats, so I'm importing whole sub-module stats from the scipy module, that plot lib we know well, seaborn we know well, filter warnings we know well, I want to render my plots on this web page, so I'm running that magic command and I want to ignore those ugly pink pieces of result to the screen, introduction, there we go. Let's start by talking about this ability as a mind, a thought experiment, imagine we can do, take a million healthy individuals and we can do their white cell count. Now, certain values are going to come up more often than others, it's going to be much more common to see a white cell count of, say for instance 5, than it would be in healthy individuals to see a white cell count of 1, it's going to be this distribution of values. I can't get a very powerful machine though, or I can get an underpowered machine, one machine will only tell me 5.5, one will tell me 5.5, 6, the next one 5.5, 6, 4, the next one 5.5, 5.5, 6, 4, 8, et cetera, it can go deeper and deeper and deeper and that is when we start talking about continuous variables. A white cell count is a form of continuous variables because I can go in deeper and deeper. Now for sure there is a single white cell count that can be no further divided but that becomes so ridiculous because we're talking something in 10 to the power 9 individual cells per liter, so it becomes a bit ridiculous, so we do talk about this as a continuous variable and we have to lump some of them together. We can say, well anything from 5.5 to 5.55, we're going to group together, we call it a bin, we're going to put it in a bin. So we've got to get away from seeing individual values as just a single value. There's this seamless range of values which we now start to look at. Now random values come in patterns, we call them distributions, I've said that. Most variables pertaining to healthcare come in the normal form, this white cell count will be normally distributed. Now let's play with some data. I want to make all of the theory I've spoken about now, I just want to make it clear visually. Now I'm just going to warn you this has not occurred that you are going to write when you do your own data analysis, it is just for us to play around with so you can get a feeling for what variables are and what distributions are. I'm going to create this computer variable called rv1 and inside of it I'm just going to put stats.norm and no arguments whatsoever, so I'm just attaching the normal distribution to rv1, there's no samples attached to it, there's no values attached to it, it's just a normal distribution. But I can call this method rv1.mean, if I were to type rv1, rv1. and hit the tab key you see there's some things attached to it, the one I want is probability density function, I want the mean with no arguments, I just want that. This normal distribution is very special, it is the standard normal distribution, you see there is standard normal distribution, it has a mean of zero, it has a median of zero and it has a standard deviation of one, you don't know what these things are, we will run through them. From this distribution though let's take a few samples, so the computer understands when it gives me these values that some of them should occur more often and some of them should not occur too often, so I'm going to ask it to give me 2000, look at the code here, computer variable rv1 underscore values equals stats.norm.rvs, we've seen norm.rvs before, a random variable set, the argument that it takes a size and I want 2000 so I'm going to ask the computer to give me 2000 but make it come from a normal distribution please and in this instance it's going to be a standard normal distribution so the mean is going to be zero. I'm going to put that, all those 2000 values that are inside of a pandas series again a new computer variable name pandas.series and I put all those 2000 in there and remember we can then take a pandas series and we can just describe it, give us normal descriptive statistics. Let's run that, indeed I get 2000 samples, a mean is very close to zero, every time you run this code you will get new values, standard deviation very close to one, a minimum of about minus 3.7, maximum of 3.2 and you see the median there. Let's plot this out in a graph just to show you, plot.figure the other computer to get ready to draw me a figure of width 8 and height 6, what type of figure do I want, I want an SNS distribution plot SNS.displot, what do I want to plot in the arguments, I'm going to say all those 2000 values I don't want the bins for now, let the python just do its own calculation for the bins and let's show that plot, there we go. So C was much more common to find the values around zero, there's my kernel density estimate, I want to hop back to this, this is continuous random variable, in other words this is now what we become interested in, no longer a single value but between this value and that value, I can increase the size of those bins by making the bins fewer, let's run that, so there are more values inside of those bins, but this is what I'm interested in, this normally distributed curve, I can now ask the computer the following, let's do that. Rv1, remember now what Rv1 is, Rv1 is just that, stats.norm, it's just a normal distribution, it doesn't contain any values in it, rv1.cdf cumulative distribution function, what does that mean? Now just think back, it might very well be that we did this analysis and our little value falls, the mean, the difference between our two means was here negative two, what was the probability of getting a value as extreme as negative two? The way that the computer does that, it always counts from the left hand side, so it's going to start counting the area under the curve from negative infinity up until this point negative two, and it's going to tell me the area under the curve of this, how likely it was, and you can see it, the likelihood is very small because the largest area under the curve falls here, so that's a cumulative distribution function, rv1.cdf negative two, control enter, control return, it says 0.02, so 2.2%, 2.3% chance of getting a value of negative two, isn't that a thing of beauty? Now I can ask the computer what was the probability of getting a value as high as 0.7, now where's 0.7? 0.7 is about there, but remember what the computer's going to do, it counts from the left, so it's going to give me the probability of getting a value from it as extreme as 0.7 and less, but that's not what I want because I'm falling on the right hand side of the middle, so I want this area under the curve towards this side, towards this side, that's what I want, but I know that the area under the curve is one, so if I take one and I subtract from it this area up until this 0.7 spot, what I'm left with is this that I'm looking for, so I can say one minus rv1.cdf 0.07 and I get a 24% probability of a p value of 0.24 of hitting a value there is 0.7 about there and more. Just to show you that things are, let's say two, now it's symmetric about the zero, negative two, two, you'll see the probability was exactly the same. To get a negative two would have been this area under the curve here to get a value of two or more as this area here. I just have to remember that if it's on the right hand side it's one minus because it always starts counting from the left, it always starts counting from the left. So there we go, your value that you find is going to be somewhere there. I want you to remember the white cell count example that you can go into a deeper and deeper and deeper and deeper. It is a continuous variable, so you can only ask the question what was the probability of not getting a two? We're not going to die here, it's not discrete. So I can only ask the question what was the probability of getting a value as high as two and more or negative two and less or 0.7 and more. Now it's not about it being negative a positive number, it's just you divide the graph in two by the median that runs right down the middle and you can calculate what the probability was of getting a value as extreme as yours. As I say it might very well have been that you did the analysis on two groups and the mean between those two were 2.5. Let's do that, let's plug in 2.5. 2.5 there and lo and behold you found a statistically significant difference because your p-value is 0.006 if 0.05 was our cutoff. Okay if that was our cutoff I want to leave it at 0.7 where it was. Now there are other forms of distributions not only the normal distribution or the standard normal distribution as I'd say. We can also create our own other ones to play with and we can use these arguments. Now I've combined this whole thing in one go I'm going to say rv2 underscore set equals pandas.series and what do I want? I want all of this stats.norm.rvs so I'm going to ask it to give me out of the normal distribution some random variable set. LOC means mean, scale means standard deviation and I want only 200 samples. Let's run that, let's just describe that because we put it inside of a pandas data series so we can describe that. 200 values mean of 9.97 standard deviation of about 4 we see the values there it's plotted and there's our plot. You can see it's only 200 values now so it's not as beautifully randomly distributed but once again you can see there is now a standard mean of 10 it was more common to have 10 so I've just shifted it up I needn't go just for the standard normal so if you want to play with these things you can plug in your own standard deviation in your own mean and your own standard deviation in how many data points really how many data points you want. What I want you to take away from this is this is where p-value comes from from your value more from your value less and that gives us significance we're going to stop this lecture here and we'll continue in part two where we're going to just be looking at completely random sets and then explaining exactly what az and at distribution are.