 In a previous video which I'll link up here and that will be done in the description, we looked at simulating the difference in means for the same numerical variable between two groups. I'm going to show you how we can do that just with the F distribution or calculating the F ratio. And one way, one place that we can use that if we want to compare more than two means. And in this example we'll see three means and I'll show you how to build up a sampling distribution of F statistics. And we can see where our F statistic falls on that sampling distribution and hence get some sense of how likely it was to have found our F statistic or more extreme. As in the previous video we're going to use Julia and I'm going to code in visual studio code. Once again there'll be a link in the description down below where you can see a video where I set up Julia and visual studio code. So as always I'm going to have at the start of my scripting file here all the packages that we're going to use and we're going to use data frames random distribution stats plots and stats base. So I'm just going to hold down shift and hit return after each of these. Now first of all you'll see Julia will start and that data frames package will be imported. There we go. I'm just going to install all the other or at least use all these other packages import all of them into this active session of Julia. So let's have a look at the F distribution. Now the F distribution takes two parameters and that has to do with how we look at the numerator and denominator of the F statistic and the F statistic is a ratio and I'll show you how all of that comes together. Now I'm going to use the plot function there and then from the distributions package you can see I'm using the fdisk function and I'm passing those two parameters. I'm also going to do a little bit of coloring there and just put a label and there we can see the F distribution for those two parameters. So you can certainly see this is not a normal distribution. It's not symmetric and it's sort of certainly not centered around a mean. And what we're going to calculate is an F statistic or an F ratio which will fall somewhere on the x-axis and the way that we will think of how likely it was to have found our test statistic or more extreme would just be you know if it was here to just draw a line straight up from two and then everything towards positive infinity. That area under the curve there that is going to be this likelihood of having found our F ratio or more extreme and in that way with randomization we're going to be able to simulate a p-value. So let's set up our data in this instance we're going to have 300 samples in our data set and we're just going to have an equal size now that's obviously not necessary but in this instance we're just going to have 100, 100 and 100 and if I divide 300 by 3 I'm going to get back a floating point value in Julia so I just have to stipulate that I want an integer and it becomes important when we use the repeat function to use an integer. So there we have I have N and N group and N underscore group so the value is 300 and 100. I'm also going to set an ID column because we're going to build a data frame and I'm just going to use a range object there 1 to 300 because 300 is in the value N and then I'm going to create a computer variable called group and I'm going to vertically concatenate 3 vectors and for that we use the VCAT function as you can see the VCAT and I'm just going to pass 3 vectors and I'm going to create each of my vector with the repeat function so I'm being very verbose here just so that you can clearly understand how I put this together so the repeat function I pass a vector that contains one element which is a string which is the Roman numeral 1 and how many of those repeats do I want? Well I want 100 and this is where it becomes important to place a integer there and if I didn't use int there I was going to have 100 dot zero which is a floating point value and you can't use that with a repeat function. Then I'm going to repeat the Roman numeral 2 100 times and the Roman numeral 3 100 times and I vertically concatenate all of those so I have this long vector of 300 elements so let's execute that and I have this vector as you can see there if I hover over that it starts with 100 ones and then goes on. Now my next variable is going to be a continuous numerical variable I'm going to call it mass just so that it you know can fit in with anything you want whether it's a product on the shelf or from a factory floor of some animal or organism doesn't matter so I'm going to take all of these values from a normal distribution and because I want three of them I'm just going to stipulate three means and three standard deviations so we see there 98 100 105 so I'm just setting these values there we go all six of them because now I'm going to use the vcat again and I'm going to assign that long vector of 300 elements to the mass computer variable and each of them that I'm vertically concatenating is going to be from a normal distribution so I'm going to use the ran function first argument is from what we want to take these random values so the distributions package I'm using the normal function there and I'm stipulating my mu1 and sigma1 that is my mean and standard deviation so there'll be 98 and 20 and how many do I want well I want 100 of those and then from 115 for the second one and 105 and 10 as fast the mean and standard deviation is for the third one so just do all of those and now you can see that'll be neatly fit in with my group one group two and group three and I'm just also going to just seat the pseudo random number generator and if you use 12 you know there's nothing special about using 12 but if you use 12 you're going to get exactly the same pseudo random numbers so I'm just going to highlight all of them hold down shift and hit return or enter and I have my 300 element vector and then from these three computer variables ID group and mass I'm going to create three columns in my data frame I'm assigning my data frame to the computer variable DF from the data frames package we're using the data frame function so I always just show you where these functions come from by you know using the package name as well so remember we don't have to put symbol notation there or use quotation marks when it comes to the column header or the name of that column which will then be a statistical variable for us we can just write out the word or the character so ID I'm going to sign to that my 300 element vector ID to group I'm assigning my 300 element vector group and to mass I assign that 300 element vector mass so if we execute that I'm going to have a data frame that's going to be 300 by 3 so you can see the three columns there ID group and mass ID is a 64 bit integer group is a string and mass is a 64 bit floating point number so that would you know mean from the group which is a categorical variable I can create my groups and the variable that we're going to compare between these groups as a continuous numerical variable so no problem there as always I like to create these sub data frames they just you know if I'm working on different parts of a data project it's just easy to work with data frames that only contain the samples or subjects that you want so I'm going to use the filter function and the filter function takes this little argument there and that R is just a placeholder so R dot group and you know I think last time we used the word row you can use X you can use whatever you want and it's just going to go down each of the rows in the group column it's going to look for those that have one in them if that returns a true that conditional that row will be included otherwise not in the second argument there is just my my data frame which this filter function would refer to so I'm going to get create these 300 by 1 3 100 by 3 data frames and they will all now only have you know values in for the specific group I also like to extract vectors of the values that I'm going to work with for that I use the collect function and I'm going to extract df dot mass so that's all 300 masses and that dot notation just allows me to refer to that column and I'm going to assign that to mass underscore all so that'll be a vector of all 300 masses and then I'm going to do 100 element vectors for each of those no you know we had that before when we created the repeat function but I'm just assuming that you're importing a CSV file and that you have a data frame and you just want to extract those vectors from the columns so very quickly I think we start off with some descriptive statistics now we chose you know we set the parameters for the normal distribution from which we took these values but you know so we know what they're going to look like but let's use the describe function and that comes from the stats base package and you see that's going to print out here in about with the default settings here in visual studio code that's going to print out in the repel down below here and we can see the 100 values in the mass underscore one and that vector there's no missing values we see a mean of 99 and we see the quartiles there one thing that you don't see is the standard deviation but as I've mentioned many times before just use from the statistics package you can just use the standard deviation function STD so let's just have a look at those as well and we see 104.95 and then 100.52 so what we're going to try and look at is there difference between 99.71 100.52 and 104 so we've got three means that we're going to compare and of course we can't just subtract them from each other as we did when we only had two samples so we're going to just simulate here you know many possible if ratios this is do a little bit of data visualization as well so I'm using this macro DF referring to my DF data frame and I'm going to do a box plot of all three of those and we create this box plot just to show me the masses of all three and the ideas is their difference in the means between these three groups of course here we see the median and the first and third quartiles and then you know suspected outliers there beyond the whiskers so let's generate a test statistic now our test statistic is going to be an F ratio and I'll show you the equation for it in a little while what we need for that equation though is the mean of all of this the values combined so that's going to be the mean of all 300 values and I'm going to sign that to mean underscore mass then we need the mean of all three of the individual groups so I'm just going to sign them there and we saw them before now let me show you this equation now this is a PDF of the Pluto notebook that I used to do this exact same analysis and what the Pluto notebook of course does there's Jupiter notebook it allows us to make these nice little notebooks and I'll certainly make that available and get up together with this dot jl file that we're working with individual studio code but if we scroll down here you can just see everything we've done before it's all there and what I want to show you here it's a scroll down there's a little bit of information for you about their ratio and their statistic that you can certainly read but this is what we after it's a very simple idea first of all you can see there that I have all the values there so I'm just representing them in very schematic form and we only have two groups in this little schematic but you put all the values together and you calculate a mean and that's going to be our initial model we call it the mean model you can that's what you can call it and that says that whatever this mean is will be the predicted value so if you're thinking about this variable and now since this is mass given any random subject we're going to predict that its mass is the mean of our model and that's a very poor model of course and what we interested in is the difference between each of the values and the mean and because you know we subtract one from the other and depending on the order we might get negative and positive numbers we square all of those so if you think about it we well on our way to thinking about the variance because the variance is going to be the difference between each value and the mean and you square all of those you add all of those square terms and you divide by how many there are minus one for a sample variance so we're not going there we're all only here considering the difference between each value and the mean squared just keep that in mind and then on a fitted model and we use these terms because it's very similar to thinking about linear regression so if you know a little bit about linear regression so they keep that in mind and what we do here because you know we don't have any medical variable on the x-axis so we can't do this least squares line so we think about it in a slightly different way but we now separate the two groups or in our instance we've got three and if you've got more than three you obviously separate that out and for each of those now we calculate a new mean so every group will now have its individual mean and that mean will be the predictor for that so given the group that the subject is in we will use the mean as the predictor for the numerical variable which again is a very poor model but it's at least now going to be better because given the group we're going to have a you know a better understanding or better prediction of the numerical variable by its specific mean versus this overall mean and that's what we call the fitted model and we just look at each of those individually but once again for each of those and you can see I'm only you know highlighting in each group just one of the differences but again it's the difference between the value and the mean squared and we're just going to sum over all those and again it's well underway we well on our way to looking at the variance here we just not dividing by how many there are minus one and then F ratio is this idea behind the difference between these two models this mean model which is very poor and then this fitted model which is obviously well let's let's use the term obviously in quotation marks better and it's this ratio between these and you'll see the term between and within the structure between the between variance and the within variance and that might be slightly confusing so I think this is just a better way to think about it somehow this F ratio is just going to demonstrate or express at least the difference between these two ideas this very poor model which is the mean model and this better model which we kind of call the fitted model but for all of them all we want to remember is the fact that you know we we are going to sum over the squared differences so we have this mean and we're just going to take each individual value and we're just going to do the square difference and there you can see the equation there we gonna call it SSM sum of squared it is around the mean and that's just the mean minus each of the values squared and then there is the F ratio so what we have in the numerator is the numerator and denominator and what we have in the denominator is the numerator and denominator so in each of these you can think of it sort of over variance over a variance but it stays a bit of a difference here so SSM minus SSF now I haven't told you what SSF is we've seen SSM all we're gonna go do is just go back here to this nice little schematic so what we're doing here is the summing of all the square differences so from each of the value from the top to the bottom you know subtract the mean from it or it from the mean doesn't matter because we square all of those and we just add all of them together with the fitted model we have these separately for two groups three groups four groups yes indeed you can do it for two groups you don't have to use that the t distribution so there we go we have you know for each of them the sum of squares for the next one the sum of squares and all we do for those individual sum of squares we just add all of them so SS fit will just be the sum of each of these individual sets of sums of squares and that's all we do for SS fit so SSM is going to be larger than SS fit SS fit is going to be better so the sum of the sum of squared errors are going to be less but once again very simple just do the squared errors for each of these individually so in this instance you land up with two and our example we're gonna land up with three and you just add all of those three it's a simple summation and then we divide that by in the numerator we divide that by this symbol here called pf underscore fit minus p underscore mean and that p stands for parameter so how many parameters did it take for our fitted model if we go back here well it only had one parameter there was only one mean in there so that p mean is only going to be one p fit in this instance well we had two means so we had two parameters in calculating that SSF so that will be two and now in our example we're gonna have three so p fit would be three the three parameters three means and if we look at the denominator it's this SSF so if you think of SSM minus SSF over SSF that is just this proportion that we trying to express its denominator though is the complete sample size minus how many parameters they were in the fitted model so keep that in mind that is what this you know calculating the safe distribution is all about and we're gonna do that in code so back to our visual studio coger and that's where we see p mean p underscore mean and p underscore fit there was only one mean in my mean model and there were three means they're gonna be three means in my fitted model so I'm just saving that so let's calculate SSM I'm gonna call it SS underscore mean for descriptive purposes I'm gonna sum over all the square differences so there's my difference differences I'm taking mean mass remember that was the all of them together dot minus mass that dot minus means do that in each instance it's gonna take the mean and for each of the values in the mass all 300 values it's gonna subtract that from the mean so you've got to put that because you know that is only a single value that is 300 values so you've got to put the dot there to indicate that it's done per element and then each of them dot square so each of those differences I'm squaring each of those and all of that combined goes into the sum function so I'm summing over all the square differences and I get 73,888 now I'm going to do it for each of the three individual groups no no difference there other than I've got a different mean now it's mean mass one and I'm subtracting that from that each element in mass underscore one again with a dot minus and the dot square so now I'm going to get these three values and all I'm going to do in the end for SS fit I'm just adding all three of those there we go and now eventually we can use our F ratio function that we saw it's SSM minus SSF divided by PFIT minus PMIN and that goes over SSFIT divided by N minus PFIT and we see a F ratio there of 3.26975 etc that is our F statistic and if you've looked at textbooks or tables before you'll you'll have seen an F statistic and that's going to fall somewhere on that F distribution plot that we saw specifically for the two parameters and if you ever wonder where those two parameters come from well they're right there there's our first parameter for an F distribution and there's our second parameter for an F distribution so as simple as that so now it's time to do our randomization again under the null hypothesis of a simple null hypothesis we're going to state that these three means are equal to each other under that null hypothesis it means it does not matter in what group a subject falls you know we can swap two of them around because our null hypothesis states that there's no difference between these groups I can just interchange them and that's what we're going to do we're going to randomly well put them all together in one row scramble them up shuffle the whole 300 and then for that same sample sizes as before which in our cases the easiest was 100 100 100 if I take all of those values in a row shuffle them up down up down randomly first new 100 is group one second new hundreds group two last 100s group three I have news three new groups and I can do calculate this F ratio all over again start from sketch I've got my 300 values now shuffle them up and down make three new groups calculate a new F ratio and if I do that over and over again I'm going to get this sampling distribution of F ratios so to hold these values we're going to start off just with an just defining here creating an empty vector and I'm going to call it F underscore ratio underscore sampling and I'm going to do 20,000 in this time F ratios that we're going to through randomization that we're going to create so I'm just using reassigned there 20,000 and I do that because if I want to change it later to 40,000 or some less doesn't matter I can just change that value so once again we can use a for loop so you can you can write much better code than this I'm being very verbose as before just so that you can see exactly what happens so my for loop is going to run from 1 to 20,000 so the first thing that I do is I do a shuffle of the mass values remember the masses all 300 values the shuffle function from the random package it's going to just reshuffle them in a random order and I assign this new order to shuffle underscore mass and then I'm going to calculate the mean of this new shuffled mass now remember that's going to be no different from before because all 300 are there so nothing's changed there and I'm going to just do this reassigned SSM which is just going to do the mean of the shuffle mass minus you know each of those shuffled ones squared so just as we had before for SSM SSM now remember this is going to be no different because shuffling them around they're all together I'm not taking a random selection with with replacement it's just the same as before but now what is more interesting I'm going to have three new different groups so I'm going to create new group one and that's going to be the first 100 values in my shuffle underscore mass my second group is going to be from 101 to 200 and the last one 201 to 300 so just be in mind if your sample sizes were different just keep the sample sizes the same but because they know all shuffled now there are going to be reassignment of the group that a subject is in but under the null hypothesis this should make no difference because the null hypothesis states that well the three means are equal to each other anyway and then I'm just creating the me calculating the mean for each of these three new groups I'm calculating the new three fitted sum of squared errors and I'm just adding all three of those so that I haven't a new SS fit and then I'm calculating a new F ratio I'm going to call it reassign underscore F underscore ratio and we calculated it exactly the same as we did before for that F ratio and then I'm going to append to my initially empty vector there is my initial empty vector I'm going to append to that this new F ratio that I've just calculated and then I'm going to run through this 20,000 times now powerful computer lots of RAM no problems it's very quick now let's have a look at the histogram let's have a look at the histogram of this distribution of F ratios now remember if we come to the left-hand side there we could see the plot that we have before for the F ratio using an equation this is what it would look like for those two parameters so let's have a look if we simulate we've got our 20,000 simulated values there I'm going to use the histogram function those 20,000 values but I only want 60 bins I want a bit of transparency and I'm just giving a label to my function let's see what it looks like let's see and look at that same or similar sort of shape as we had with our theoretical distribution now our F ratio falls somewhere here remember it was three point something and we just wanting to know how likely was it to have found this F ratio or more extreme so wherever we have our F ratio in the x-axis we want to know how many of our values this is a histogram so it's gonna you know just in every bin that counts how many of those fell in that bin so if we have a line here three point something we just want to know what fraction of all these 20,000 values were equal to our F ratio or more extreme so that's towards the positive infinity side and that's all that remains I'm just going to sum over this I'm just going to ask the F ratio sampling dot greater than F ratio so it's going to go through all 20,000 and return true or false if it is bigger than the F ratio and remember true is saved as a one and false is saved as a zero so if I sum over all those zeros and ones it's going to sum over it's going to give me the sum of all the true values so how many of these 20,000 values were more than our initial F ratio and I'm just dividing that by 20,000 so that I get a fraction and if we do that we see 0.039 and that is very close to using just your statistics with the mathematical equations that come with that to calculate AP value from an F distribution given those two parameters so that is you know V V I think intuitive way or different way just to think about how likely it is to find your value given the null hypothesis now we don't have access to the whole population the only thing we have is our sample and we assuming that that sample looks very much like the population so we using that as a proxy for the population and because we're just doing this reassignment through randomization over and over again we're building this distribution of possible F ratios and we assuming that that's going to look you know very similar to the population and you know if those samples were you know taken very properly it should be quite close and here we've estimated we've have some idea of how likely it was to have found our F ratio and we can now state that if our alpha value was let was open 05 this is less than open 05 we reject that null hypothesis and we say well there is a different there is a difference between those three means and now we can start thinking about post hoc analysis just to show you the count map function here it's just going to show me how many truth and false as I had so if I look at there zero is false one is true so 773 of those 20,000 were larger than our F ratio and that's how we got to that fraction there of 0.039