 So in this next notebook really we're going to build on what we have before and I'm going to show you this idea of sampling distribution but based only on the data that we have for sample. We no longer know what the data looks like for the whole population but somehow we still have to be able to express our findings in terms of a bigger hole and how likely was it for us to find the statistic that we do do. How does it fit into the bigger picture and can we express this as a probability of the likelihood or the probability of this this result having been found. So notebook 09 comparing means and let's have a look at these packages that we're going to use. Those are the usual ones that we've seen before. We're going to have Google the data table parameter set there with our percentage load underscore x ext so that we have nice tables printed to the screen just specifically here to google colab and in case you're running this on a written a display you would run that cell there with a percentage config magic command and we're definitely going to import a spreadsheet file from our google drive and then the usual suspects pandas numpy and the stats module from the sci-pi package and then all our usual plotly modules. So what I'm going to do again you know the drill I'm going to connect to my google drive and then I'm going to change directory with a percent cd magic there and as a string I put this whole folder structure directory structure on my google drive so that I can get to the data sub directory or data sub folder because that is where my spreadsheet files live. So I'm going to click this button again run the cell I'm going to log back into my account and I'm going to copy and paste that security key and there we are we're back in business and the data file that we're going to work with is this comma separated values file data dot csv and we're going to use the pandas function read underscore csv and we're going to assign that to the computer variable df so let's do that and it's going to import that data file for us now we have a data frame let's just have a look at it printing it to the screen with that magic command that we have before and this is a data set that we've seen before so let's just have a look at the shape of that just so that we know you know what our attributes are so the shape attribute there 200 rows of data and 13 columns 13 variables so comparing the distribution of a numerical variable between two independent groups so let's just be clear about what our research question is is there a difference in heart rate values so the beats per minute the heart rate hr variable between the active group and the control group so we know we have a group variable there's a column called group and remember if we said df.columns we're going to get a list of those but we certainly have a group if we scroll up here there's our group it's a categorical variable active active active control active control so the participants here were in two groups and we're looking at the heart rate column there that is the heart rate of these participants so we want to know if there's a difference in heart rate between these two groups and it's very clearly that we're making up the group by the sample space elements of a categorical variable group is a categorical variable it is a sample space of two elements so we can generate two groups of participants heart rate is a specific numerical variable so we're going to compare this very same variable between two groups and that is what we mean by comparing the means between two groups our groups are formed by the sample space elements of a categorical variable in our data set and we're going to take a numerical variable and we're going to compare the means and we're going to get some we're going to get a difference of means and then we we somehow got to construct those histograms that we saw before and we're going to see the the difference that we found is it a rare finding or is it a you know a difference that that is not that rare at all so let's do that and remember we're going to use conditionals here and I'm not using dot lock or dot i lock which I should actually do but short hand notation we can actually do that so I'm going to create this two num pi arrays so hr underscore control and hr underscore active because that's a nice and descriptive we're working with a numerical variable hr heart rate and that is the control group and that's the active group so we're going to say df and then we're going to use uh go down the df dot group down that column all the rows where it says control so equals equals control all the ones that 3 2 and 2 are going to be included and from all those that were included that are you know have a value a data value of control in that group variable we're going to take the heart rate and we're going to convert that to a num pi array so dot 2 underscore num pi we're going to use that 2 underscore num pi method and then the same for the active group so it's same conditional that we're using here now we have two num pi arrays that we can work with first of all let's get back to basics when in data science when we look at the the data the first thing that we're going to do is just do some summary statistics we've got to start teasing the story out of the data get the the knowledge that hidden that's hidden in the data out the first way that we're going to do is we're going to get these numbers that are representative of the whole so let's do that descriptive statistics I'm using the group by uh method there so df dot group by I want to group by the group variable that categorical variable with those two sample space elements and then I want to compare the heart rate and I'm using the dot describe method because that gives me a count so I see there's a hundred participants in the active group 100 in the control group as far as their heart rate is concerned we have a mean of 76 in the active group and a mean of 72 in the control group and we want to know is there a difference between those two means and again I can subtract 76.88 minus 72.43 or 72.43 minus 76.88 and I'm going to get a positive and a negative number but it sort of starts giving me the idea that it really looks like there's a difference between these two groups as far as the heart rate is concerned we just have to somehow express if that different difference is you know is it big enough we have to express express it in some way we see the spread of the data there the spread of the data is sort of even we see the minimums there the first quartile the second quartile or median the third quartile and the maximum so we have some idea of the spread of the data once we know this little bit of statistics somebody's statistics let's just visualize the data because that's in actual fact a little bit more powerful so there we see a box and whisker plot so it's px.box I pass my data frame on my y-axis I want heart rate so up and down here is heart rate on I wanted to be grouped by whatever we find in the group column and in that column we found those two active and control and the argument that we're going to use for that is the color argument remember and then I'm just sitting a nice little title so there's a little outlier there and by now you know what it means for this lower bound here the lower fence it is 1.5 times the interquartile range so that third quartile value minus the first quartile value multiply by 1.5 subtract that from the quartile one value and that's the lower fence and then certainly this value the minimum value here that seems to be in a statistical outline look at that heart rate only 24 obviously something something odd going on and that is clearly indicated here for us that that might be a statistical outlier so we've got some idea that there seems to be this difference between these two distributions so let's employ the scientific method called hypothesis testing and our null hypothesis we haven't gathered any data so this actually should go before we do our study this goes into the protocol of our research and we say the null hypothesis is that there's no difference between the two heart rates so the heart rate in the active group equals the heart rate in the control group and by here we mean the means that is our test statistic and it means if we subtract one from the other we should get zero so another way to put our or state our null hypothesis is that the difference between the means of these two heart rates is zero our alternative hypothesis is two-tailed doesn't matter which one which one we subtract from which one we just say they are not equal our alternative hypothesis is that these the means of these two heart rates are not equal so let's do that let's subtract one from the other and that's the order I chose there so I get 4.45 but remember this could also be minus 4.45 if I flip that difference around and now somehow I have got to decide you know is that a meaningful difference is it a real difference is it not a real difference and we've got to build up you know build upon what we have in the previous notebooks so this time around we don't have access to the whole population this is all we have but listen very carefully again to our null hypothesis our null hypothesis states that there's no difference between these two groups now if that was really true it really doesn't matter in which group the participant was because our null hypothesis state that those two are equal to each other so we can randomly reassign a participant to another group we can throw them all into one basket and generate two more groups from those because our null hypothesis states there is no difference between the two and that's very crucial if you think about this kind of analysis our null hypothesis is based on the assumption that there is no difference and if we find a very common difference we can't reject that null hypothesis and our distribution of our statistic is going to be based on the null hypothesis being true in actual fact we don't know that we don't know what the whole population looks like but we've got to start somewhere in our base assumption is there's no difference between those and we say given that it's true we don't know if it's true we just say given that it is true we can randomly reassign these participants doesn't matter what group they are and we can sample from this whole bunch now and we can build a distribution of possible differences because it doesn't matter which group a participant is in a very deep deep deep and crucial point when it comes to data analysis and I really want for you one for that to sink in so that you can really understand is our null hypothesis we assume that to be true we have no idea whether it is but we assume that it is and hence we can build a distribution based on that assumption but it is just an assumption and I think you're starting to get to what statistical analysis really means it is a human we can always see this as a human construct we're not really proving a difference we've got this framework that we build and we put a draw a line in the sand and we say if it's beyond that then there really was a difference we never proved that there was no that is just not the way that it works so let's do this now remember we have 200 samples in our data set and it just so happens that they were equal there was 100 participants in each each group so what we now say is let's randomly reassign these people so we we chuck all the 200 into you know into a bag they're all together now because our our assumption is that they all the same and what we're going to do is the following we're going to imagine that we could do our study 10 000 times over you can just pick a large number and we're going to draw at random 100 participants in the one group and that 100 remains for the other group so the same number we've got to end at 200 but they've randomly reassigned they're all in one mixed bag now and we just draw those names no problem whatsoever so we're going to create this mt python list and i'm going to sign that to the computer variable mean underscore stat and then i'm going to run a for loop 10 000 times so what we're going to do is the following i'm going to create a computer variable internal to my loop of grouping and that's going to be a random choice a random choice of my df dot heart rate and so that column of all 200 values and then i'm going to put in and something new a size argument and i'm going to say a size 100 comma two and what that is going to build for me is two columns of 100 each and i say replace equals false because i don't want the same individual to be selected twice so that is a complete reassignment a random reassignment of individuals so that there's now two sets of and two new sets of 100 and they're all mixed and matched and what i'm going to do i'm going to divide them in two groups one i'm going to call group and then a roman numeral one and a group roman numeral two and i'm going to do the following i'm going to say take this grouping of two columns with 100 each and of those i want the first hundred so i'm using this range operator from zero to 100 so that means all of them actually because remember it's two columns of 100 that i'm creating with my size here and i want the first column of that so that would be index zero and for group two we're going to use that grouping and by indexing give me all the rows again so remember all i needed actually to do is just that because if i just put a colon it means all of them but let's say one to one hundred that's all of them i want the second column of them and i just want you to get that this is important python code here with this size argument set to the stuple 100 comma two it's going to take those 200 mix them up randomly with a random dot choice function and i just want to create these two little columns of a hundred each and i'm taking the first column all those hundred values and i'm assigning that to group one and all the hundreds in column number two assign that to group two and i'm doing what i'm doing is i'm taking the mean of those so i've got two new random means and what i'm going to do this empty list of mine i'm going to append to that group one minus group two so it's the mean of my one reassigned group minus the mean of my other reassigned group and i start adding my my test statistic which is the difference in means i start appending them start building them up and i'm eventually i'm going to have 10 000 different means and see how different this is from sampling from the actual population we don't have access to the population we only have access to our little sample that we have of 200 individuals so we've got a resample from them and what we do is we randomly reassign them to a new group 10 000 times over great stuff so let's just look at the distribution of possible differences and look at that it's approximates a nice little normal distribution i'm warning that this is not going to be a normal distribution this is actually a t distribution of 198 degrees of freedom but it approximates really a nice normal distribution and what i've also drawn here is our difference that we found was way out here so he has all the possible differences given randomly assignments of our participants under the null hypothesis that there was no difference between the two so i can just mix and match just take one someone out of one group and randomly put them in the other group so that i still end up with equal numbers as my original 100 and 100 so this is that last step that we had to take that we went from sampling from the population to sampling only from our data and this gives us this distribution of all possible differences and we see where our difference was i want to remind you though that our difference could also be on the other side okay but let's look at it just as it is at the moment just on the one side so i can ask a question remember this is a continuously made convertible so i can't ask what the probability of a single value is anymore because these are just it's just round off if i had better instruments i would have better decimal values so i can't ask you know when we roll the die two dice and we ask what was the probability of a 10 or more and we just add those we can't do that anymore so we're really interested in how many of all of these was actually greater than our difference here and our difference was 4.45 in the order that we did the subtraction so i just want to count of all my 10 000 how many were out here were more than that and if i divide by all 10 000 it's going to give me the proportion of these on the right hand side that was greater than 4.5 and that's what i do here i take my mean stat remember that is 10 000 values in a python list i pass that to the numpy dot array function because i need it to be an array because i'm using a conditional that was larger than 4.45 that's going to give me a bunch of twos and falses because twos are one i can sum over all of that and that's going to give me the number of these all of these that were greater than 4.45 and then i divided by 10 000 to get a proportion just you know what fraction it was so it says that 0.005 that's a half a percent was larger than of all possible differences having redone my study 10 000 times under the null hypothesis so reassignment only that small fraction was actually more than that so this was a you know the difference that we found initially without data but was very unlikely to find that difference but now we have to remember that we could also have done it in reverse so i two-tailed alternative hypothesis so we've got to do the negative 4.45 as well and that's just what i do with a code here so remember that's that as well and we also have to do this little lot so how many was less than minus 4.45 and we do exactly the same there so i'm passing means that my list to the numpy array function i'm asking by conditional how many of those were less than 4.45 and i'm summing over all the true values and i'm dividing by the sum total we had 10 000 in that numpy list 10 000 values and i see open 038 so now i sum these two proportions or these two fractions with each other and i get about 0.009 0.0085 which is about 0.009 and that's 0.9 of a percent that is a very unlikely finding it's very rare and once again as human beings we've decided we're going to draw a line in the sand somewhere and we're going to draw that line in the sand of 5 percent so if our if these values that we found was made up less than 5 percent of this of the total number 0.05 of this total number of 10 000 we're going to call this out and say we now reject our null hypothesis and we accept our alternative hypothesis and we say there is a difference between these two groups in the heart rate between you know between these two groups once again we haven't have we really have we really proved this no we built a framework and we said based on the null hypothesis let's create a distribution of 10 000 or 20 000 or 30 000 now remember before computers came along this wasn't so easy was it but now it's very easy i can say do it 20 000 times 100 000 times who cares i can just build all a distribution of all the possible differences by continuously resampling from my actual values i know nothing about the population and i say i can now put my difference that i found in my one study i can put it somewhere here and say well given the null hypothesis this is the distribution of possible differences i could get this is my one it's quite rare and i draw this line in the sand at 0.05 so i say it's beyond that 0.05 i'm rejecting my null hypothesis and accepting my my alternative hypothesis it's a framework decisions that we made this is how we're going to do it and now you can see how this all fits together so just to remind you um through mathematics we have this thing called the t distribution very very nice equation you can look it up online and i can use the stats dot t test underscore ind that means independent groups these two groups are independent of each other i pass the two umpire rays of heart rate values there it's going to give me back two things a t stat and a p val and if i do the p val we see 0.009 and that was our approximation 0.009 so we can approximate this p value by this continuous resampling and it's this continuous resampling i think that gives us a good and true understanding of what the probabilities are of finding the values that we do find putting it in in a picture of something of something bigger so don't worry about this code i'm just going to show you um with those uh mathematical equations and just just using some quote uh some code there this is what it actually looks like so there's my t distribution and i see my critical values there that would represent two and a half percent on this side and the orange line that side two and a half percent on this side and these were my actual differences converted to two t values and so we're looking at this little area under the curve this little area under the curve and it's obviously beyond my critical t levels my critical t levels would be two two point five percent on the one side two point five percent on the other side combined that's five percent so five percent of the area under the curve and we have a statistically significant difference we reject the null hypothesis and we accept the alternative hypothesis and i just want to say this a million times over i built this distribution of all possible differences under the assumption that there was no differences between the two groups as simple as that so let's do another example we're going to compare the means of the systolic blood pressure between age groups and the reason why i do this i just want to rehash a little bit of python code because our age group remember that's a continuous numerical variable i can't divide the patients by or the participants by a continuous numerical variable i've got to do some burning so i'm creating a categorical variable from a numerical variable and let's let's just have an arbitrary cutoff and we're going to generate this new column in our data frame so df and then age group and i'm going to sign to that just two groups and i'm going to have my cutoff at 65 so everyone that was younger than 65 was going to go in group one and everyone older than 65 was going to go in group two so i'm using the where function in numpy numpy where and i'm passing at this little python at the pandas series so that age column conditional is a less than 65 if the wave function says if this first part is true give it this value and if it's false give it that value that's all the numpy where function does and i think you know we've looked at it before and that's just a little bit of a rehash so let's look at our age group value counts now so we see 152 participants are now in group one and 48 on group two so no longer 100 100 so there's something new so first of all let's just do some summary statistics on all of this and there we see in group one there was 152 participants only 48 in group two and we see a difference in in as far as the sbp so that's the one we interested in here sbp and that's systolic blood pressure if you look at your blood pressure 120 over 80 that's that highest one so we see in the younger age group the mean heart blood pressure was 153 and the older group was much higher in 168 and we see the standard deviation and the quantiles there let's visualize it now that we have some understanding and knowing well there seems to be a difference here and look at that again we have this outlier here someone with a systolic blood pressure 52 and that participant is in real trouble that's a very low blood pressure anyway so suspected outlier there and we might have to do something with that observation in any way now we can start asking is there a difference between these two these two these two groups in as far as the systolic blood pressure is concerned so let's do the following what I'm going to do is to show you a slightly different way to do this reassignment we did that size and we passed the tuple 100 comma two before you can do that this time around and instead of 100 100 we can have 152 and 48 in the two in the two groups but that's not going to work so well with with a where so I'm going to show you a little bit of a difference what I'm going to do first of all in this method of reshuffling my patient the participants is that I'm going to just make a copy of my data frame so I'm going to say df dot copy and deep equals true means completely rewrite in the computer's memory because if I start changing something it might influence the original and I don't want to change my original and this method is going to interfere with the original so I'm just making a what we call a deep copy so it's a full on carbon copy but it lives in a completely different space of memory so I'm just going to work with deep underscore copy so let's see this way of 10 000 I'm going to you know repeat my study 10 000 times based on just the data that I have again I'm going to start off with my empty python list and I'm going to have a for loop that I'm going to run over 10 000 times so what we're going to do is we take this whole my deep copy the way df underscore copy sbp so I'm going to take that series or that column and I'm going to convert it to a numpy array and I'm going to assign that to a computer variable called sbp internal to my for loop here and then I'm going to do the numpy dot random dot shuffle of this numpy array and if I shuffle this it's going to interfere with the data frame that I have so this df copy so I'd rather mess up my df copy and keep the original data without any kind of reshuffle so that's why I'm working with a copier because that shuffle is going to do that for you so I'm calling numpy dot random dot shuffle sbp so sbp remember that's just a numpy array of all the sbp values all of them in one hat and now what I'm going to do is I'm going to do the following remember they are 200 sbp values but I don't want to divvy them up 100 hundred I've got a different ratio of my participants now so I'm going to make group one and that's going to be the mean of this first 152 it's a complete reshuffle they know they're not in the same order anymore so I'm using my index from 0 to 152 so that's going to give me the first 152 and then from 152 to 202 remember that 201 remember that one 201 is not included and it goes to 200 so that's the other 48 so my resample is in the same proportions as what my patients my participants were 152 and 48 so I'm reassigning those by reshuffling them that's a different method of doing this and I'm calculating the mean of each and I'm starting to append my empty list here with the differences between those two so this is a different way for you to go about specifically if you have unequal number of participants in both groups so once again we're going to view this sampling distribution and what we're going to do here is just create the two arrays and we're going to look at this difference so let's calculate that now that was our difference between the two groups initially so let's just go up up up up up there we go we had 153 and we had 168 so the difference between those two there we go so I'm doing a younger SPP and an older SPP where the age group was one the age group was two converting that to a NumPy array as fast the SPP is concerned and I'm just doing the difference but remember I could also have done the subtraction the other way around so there'll have to be a positive 14.8 difference as well so that's a two-tailed alternative hypothesis and let's just use graph objects to do that and we see our 10,000 possible differences and once again the actual difference of negative 14 positive 14 that we found and I think you can well guess by now this is going to be less than the critical value if our alpha value is 0.05 so let's just do that so we're going to sum up all the ones the total number of ones that were less than this divided by 10,000 so that it's a fraction all the ones more than 14.8 and divide by 10,000 so we get a fraction so let's do those two and I see it's very very few and if I sum that up that is going to be the probability of this difference given the null hypothesis that there was no difference if we build a distribution from that assumption this is what we're going to get we put the ones that we actually found in there and we just sum these two fractions of of the whole there and we get a very small value and let's compare that with the statistical way of doing it we're going to use the t test again and let's see what that proportion worked out 0.0004 and we got 0.0007 from our 10,000 samples you know it's so small that's it we reject our null hypothesis we accept the alternative hypothesis there was a difference in systolic blood pressure that difference of 14 was significant and that's how we go about it now I think you've got a very good understanding now is that we deal with only the data that we have we can build a distribution from that from a null hypothesis and we can put our difference somewhere on there and it really doesn't matter what statistic we're working with we can build this distribution of a possible statistic and see where our study statistic falls on that distribution it's it really is a beautiful thing okay just for completeness or maybe I should say just interest sake with these statistical tests they are also built on stringent say my stringent I should say assumptions so one of them is that the variances between our my two groups have got to be equal so what I'm going to show you here is I'm just going to this has got nothing to do with the data that we had before we're just simulating new data now I'm going to see now my pseudo random number generator with the integer 12 I'm going to create two groups and they're going to have taken from a normal distribution with a mean of 100 both of them one is going to be taken with a standard deviation of 10 the other one of 12.1 and 100 participants in each group so it doesn't matter this is now you know completely new and random data and we see there a standard deviation of 10 and 12.2 so if we square those it would 10 squared and 12.1 squared so that the variances are quite different and we have to know if before we use a t-test if those variances are equal once again what is what difference do we have this cut off for and for that there's something called the levines test so I'm going to stay stats dot levine and I just pass my two numpy array values there and it's going to throw out a p-value for me and it's 0.04 if my line in the sand was 0.05 I have to reject the null hypothesis that this variance these variances were equal and I accept my alternative hypothesis that these variances were not equal given that I can't use what we call students t-test I actually have to use an unequal variance t-test and of course in no problem for python there is such a test stats dot t-test underscore independent so that's still the normal t-test I pass my two arrays but I have this new argument here equal underscore var and I set that to false I have this you set that to false and that's going to calculate for me a slightly different p-value it's still going to be uh there's still that p-value there um but it was calculated in slightly different way the mathematical equation for doing this was slightly different so that was just a little bit of uh just for for extra interest sake what I want to for you to take away from this is really the fact that we are given data we choose a test statistic we continuously re-sample under the null hypothesis and then we put our difference some way our statistic some way on that distribution of possible statistics and that's as simple as that we now can express the probability of having found the result that we did