 Welcome to this video lecture where we're going to simulate a p-value. So we're going to have two different groups, same numerical variable, and we're going to compare the means between those. Now, under the null hypothesis, remember, there is no difference in the mean. So we're going to use the simplest form of the null hypothesis. So we're going to state that there's zero difference between the two means. Two-tailed hypothesis with a zero difference. And I'm going to use the Julia computer language to do this. Now, I do have a video lecture out on how to install Julia, but just in case you haven't done so already, there's the website JuliaLang.org. And you can just click on the download button there. And you'll see at the time of the recording, we had the current stable release of version 1.7. And I'm going to do my coding in Visual Studio Code. So you can just search for Visual Studio Code or code.visualstudio.com and you can download the appropriate version for your operating system. Now, I've set up my system so that I can type Julia in a command prompt or terminal window here on the Mac and it will start Julia for me. There is a command that you can run. I'll show you the one for macOS on the screen. So that will add Julia to your path environment as a path environment variable and you can just type Julia and it'll start automatically. The reason why I do this is that I have separate environments for all my Julia projects and I install only the packages in that environment for that project. I don't install any packages in my base Julia installation. Once again, there's a description down below that will take you to a video where I show you how to set this up. So what I've got to do in my terminal window here is just to navigate to the folder that holds this environment for me. So there we are in the environment. I'm just going to show you the files that are in that environment and you'll see there's a project.to.ml file and a manifest.to.ml file and that holds all the information about the environment, the Julia environment that I've set up for all my data science projects. And all I have to do now because I've installed Visual Studio Code is to type code, space and a period or full stop, hit return or enter and that's going to start up Visual Studio Code for me. And there we have Visual Studio Code and the file that I'm going to look at here is resampling under the null hypothesis. By the way, for Visual Studio Code, if I click on these little buttons on the left hand side, you'll see all the extensions that pop up and you can do a search and there is Julia. So you can see I've installed that extension and that is what allows me to use Julia inside of Visual Studio Code. When I click on settings there, tool down and I click extension settings and if I move down, you'll see the path on my system to my Julia file. So it's applications Julia-1.7, the current version at the time of this recording.app, fort slash contents, fort slash resources, fort slash Julia, fort slash bin, fort slash Julia so that this extension at least knows a way to find my Julia executable. So resampling under the null hypothesis. So we're going to certainly look at that but also there'll be a few other things in this tutorial. We'll look at seeding the Cito random number generator which has been new for version 1.7 and we do that so that we get reproducible results when we call for random values. I'm going to show you just how to create a data frame again, show you how to sample from categorical values using statspace, sample from a normal distribution using the distributions package, make sub data frames using the filter function so we can do some conditional selection so that we can create new sub data frames, how to extract vectors from data frames, how to do summary statistics of continuous numerical variables using statspace. I'm going to draw a histogram and I'm going to use plotly.js as my plotting package. I'm going to show you how to shuffle a vector, so all the elements in a vector, how to shuffle them around at random and then how to calculate a p-value using the hypothesis test package just so that we can check on our results. So as always, we start with the packages that we're going to use. I'm going to use the data frames package. I'm just going to be in that line that contains that code, hold down shift and hit return or enter and that's going to spawn the Julia REPL for me at the bottom here of Visual Studio Code and also activate the environment that I've set up because if we see that Julia prompted at the bottom, if I do my right square bracket, you'll see I'm not in the base installation of Julia. I'm in the data science environment for Julia. And by the way, if I just type ST then and hit enter, I see all the packages that are installed in this data science environment. So there are all the packages and you can certainly look at those and install them for yourself and that'll be a good set of packages for data science projects. Let's carry on. We're going to use random. That is a built-in package in base Julia using distributions, stats base, Plotli.js and hypothesis tests. So the first thing we have to do is at least to set up our data and we're not going to import any data, we'll just simulate some. So I'm going to have a sample size of a thousand and I'm going to store that in the computer variable N and then I'm going to see the CIDR random number generator just with a integer 12. If you use 12, we're going to all get the same results. I'm going to highlight both of those two lines of code. The second one is setting up a data frame and I've done that over multiple lines and I'm just going to execute and I'll explain how I set that data frame up, holding down shift, hitting return or enter and we're going to get a data frame and we can see there it's a thousand rows and three columns. So let's have a look at that. Use a computer variable DF, quite common. Use the data frame function and that comes from the data frames package just so that you know where these come from. I could say data frames dot and that'll do exactly the same thing just to reference the fact that the data frame function comes from the data frames package. So I'm going to have three columns and you can just write any words without illegal characters and they don't have to be strings or symbols. You know, the letters themselves. So ID, I'm using a range there, one colon N and remember the N holds the value of a thousand. Then I'm going to have a group column and that is going to be the sample function from stats base and the first argument there is going to be a vector and that vector contains two strings, one Roman numeral 1, Roman numeral 2 comma N so it says sample with replacement from those two elements a thousand times and then mass is going to come from, just could use generic variable mass that can be of a product of an organism, of a human being, doesn't matter what it is. I'm trying to be agnostic as far as our variables are concerned so I'm going to use the ran function to generate random values for me and I'm going to say this first argument is from the normal distribution and now this normal function, remember that comes from the distributions package so we can go there and say distributions dot and that means we can see where that normal function comes from and it takes two arguments, the first one is the mean and the second one is the standard deviation so I say a normal distribution with a mean of 100, standard deviation of 10 and I want from that distribution randomly select 1000 values for me and that's how we end up with our variable, our dataframing with a thousand rows. Now I want to make two sub dataframes. I know that I have two groups by my categorical variable called group, there is my categorical variable group and I'm going to use the filter function so that I only select rows which contain the value one in the group column so we use the filter function and then I'm setting up this anonymous function here and instead of row you can just say x and then it'll be x and then minus greater than so a little arrow symbol then you could have said x dot group but I'm going to use the word row because it kind of makes sense so go down row dot group and id with a group at least equals equals so that's conditional, is it equal to one? If that conditional returns true it will be included, if it's false it won't be included and that's what the filter function will do so if we ran this, shift return, shift enter we're going to get a new dataframe it only has 479 rows but you can see down the group column we only have Roman numeral one so it's all our observations that fall into group one and if we do that the same using the conditional for two we see 521 rows now and you can see under the group column there's only Roman numeral two so we've created two sub dataframes manually using the filter function next up what I like to do in many situations is just to extract a vector from a dataframe column so I'm going to take all the mass df dot mass so that's going to give me that kind of notation is going to give me all the values in the mass column and I'm going to pass that as an argument to the collect function which is going to return for me a standard Julia vector and I'm going to assign that to the computer variable mass so once that runs I see now I have a 1,000 element vector of 64 bit floating values and that's exactly what we want and I'm going to do two more where I just separately look at the masses for my new sub dataframes group underscore one and group underscore two so I have those two vectors as well remember again with the 479 and 521 elements and lastly I just want to know the sample size of each of these I should say so I'm just going to store those as n underscore one and n underscore two as Roman numerals and that is just passing this vector as an argument to the length function so that I get back those two lengths so as always we start with descriptive statistics if we want to know what's going on with our data and there's a described function in the stats base package and if I pass that vector to it we'll see down here if we scroll down in our repel at the bottom we see the summary stats we see the 479 elements there's zero missing the mean was 99.96 the minimum was 69 first quartile median third quartile maximum and the only thing that you don't see there is standard deviation which could of course be helpful but of course you can just use the STD function to do that let's also look at group two the masses of those in group two and we see the 521 values and they had a mean of 100 and the question is now is there a difference between these two means so it's the same variable mass two different groups formed by the sample space elements of the categorical variable so group one and group two next up we're going to visualize our data and I'm going to do it using the plot function from PlotlyJS remember we imported PlotlyJS so plotlyjs.plot and I'm just doing multiple lines which of course visual studio code is going to underline with these little blue markers because of things there is a problem but it'll work fine my first argument is my data frame on my x-axis I want mass and for my color argument I'm setting that to using symbol notation here for group and remember symbol notation there as well for that column mass so it's just going to separate out the masses for each of the two groups the kind of plot that I want is a histogram and then I just want to change the markers a little bit because remember where these two histograms drawn on top of each other you want a bit of transparency so that you can see both of the histograms so I'm going to set marker equals and then the attribute ATTR I pass a value of 0.5 or half to opacity and then layout is going to give me an access to the title argument and the bar mode argument and the bar mode I'm setting to overlay so that the two are drawn on top of each other so let's run that remember this is Julia and there's a bit of work that needs to happen behind the scenes before the first plot is rendered so that's going to take a second or two and there we see first of all right at the bottom you'll see a new icon there that's our Julia icon that's part of the Julia extension that we installed and that gives me all sorts of information in my current workspace for instance we'll see some values there N is a thousand N sub 1 is 479 we can see our data frame there so we get information on the variables and now we suddenly see a new tab with a plot and there you can see group 1 and group 2 and because this is plotly I can click on that and remove one of the groups now they both removed let's look at one so I can see the histogram of these separately and that's what's lovely about plotly and we can see the distribution there as far as the histogram is concerned it looks very similar and we don't expect that there will be a difference between these two means so let's go about doing that and we start by just looking at what the difference is in our data set and I'm going to store that difference in difference in underscore means variable and I'm just passing two vectors holding the masses for both groups passing that to the mean function and just doing that subtraction and I get a difference of minus 0.16677 etc now remember there was no reason to subtract mass 2, the mean of mass 2 from the mean of the masses in group 1 I could also have swapped that around and that'll give me a positive 0.16677 so that's very important to remember we have to look at both differences so now we're going to do this reassignment under the null hypothesis if you think about this very simple null hypothesis that there's no difference between the two means that means it does not matter in which group any one of our instances falls we can just shuffle them around reassign them to a different group now what we're going to do is we're going to keep group 1 at that sample size and group 2 at the original sample size but we can just swap members around between those so instances around those between the two under the null hypothesis there's no difference in the means between these two groups and what we say by that is or what we mean by that is it doesn't matter in which group our individuals fall the means are no different and they shouldn't change and if we do that multiple times every time we randomly reassign members to the two groups we calculate two new means and we take the difference of the two means and then we build up a distribution of all those possible means so we just do that thousands of times over and every time under that reassignment we're going to get a different difference in means and we can build up that sampling distribution and remember that that sampling distribution should then start to approximate a normal distribution so let's do just that I'm going to set up an empty vector there and I'm going to call it means and I'm going to do 5,000 reassignments and I'm going to do that with a little for loop so I'm going to say for i which is this my counter in 1 to resample so in 1 to 5,000 so this loop is going to this for loops going to run 5,000 times and what are we going to use here is this the shuffle function as you can see there the shuffle function comes from the random package so random dot shuffle and mass is just all the masses remember that's of all my subjects but I just randomly shuffle that and now I'm going to just assign that to a new computer variable shuffled mass and now I'm going to create two new groups new group underscore one and that is the shuffled mass so the order has been completely reshuffled but I'm going to go from 1 to n sub Roman numeral 1 remember that held a value of what my first sample size was so if I go from 1 to there that gives me the same sample size and then new underscore group underscore 2 that is also going to be shuffled and I'm using indexing again I'm starting at n underscore 1 plus 1 so the very next subject right till the end so it's that first lot and the second lot and they all reshuffled so I've randomly reassigned individuals to new groups and then I'm going to use append bang so that exclamation mark there means I can just continuously append to what is at the moment at the first loop is an empty vector there and I'm going to append to this means function to this means a computer variable the mean of the new group 1 minus the mean of new group 2 so I'm just every time this loop runs through there'll be a new difference in means added to that empty vector so let's just do just that so I'm just going to hover over the n there hold down shift hit return and almost instantaneously I have my 5000 values and what we're going to do here is just to plot them so I'm using plotly.js.plot again this time slightly different notation I'm just going to start using a data frame I'm just using a vector to plot so put that inside of square brackets use the histogram function as you can see there on my x-axis I want all those 5000 means and I'm adding a little bit of opacity and then outside of the square brackets is my layout and my layout is just going to contain just going to contain a title so let's run that and there we can see my 5000 sampling units of sampling distributions of 5000 values there and you can see that approximating a normal distribution now our mean, our difference in means that's our test statistic for somewhere there remember it was negative 0.16677 so that's going to lie just to the left of the 0 here but I could also have done the subtraction the other way around so I've just got to consider positive side as well so positive 0.166 and if you draw this up there you can imagine I want the area of if we were to use a function to the left of that towards negative infinity and to the right towards positive infinity what we're going to do here is just to ask what fraction of values was less than minus 0.16677 and what added to that the fraction of values that were more than that and that is going to approximate for us a p-value so let's just do that and it's a simple line of code here I'm going to sum over means dot less than difference in means remember my difference in means was my actual value so I'm wanting to know all the values how many values are less than that in my whole vector of 5000 means and I'm using the dot operator there such that I can do subtraction or that comparison at least element by element and because that returns true or false values for me and true is represented by a 1 and false is represented by a 0 I can sum over all those and that sum is just going to tell me the number of values in all my 5000 means that was less than my actual difference in means and I'm going to add to that doing the same things that were more than minus the difference in means so that's going to be positive 0.16677 and I'm dividing that by 5000 once I do that addition so that tells me the fraction of my 5000 values that was less than my difference and more than the reflection on the other side of my difference in means and if we do that we see 0.795 so that is approximating a p-value for me there let's just use the hypothesis test hypothesis test package there's a p-value function in there and the argument that I'm going to use is the equivalence variance t-test and I've just got to pass the vector 2 vectors mass underscore 1 and mass underscore 2 and that's going to work out the p-value for me and we see it's 0.7932 so very very close to each other those two and that is exactly how we approximate the p-value how we use the null hypothesis this reassignment and we work out just based on the actual difference in means that we have that's our test statistic what fraction of values are less than and more than on either side of our re-sample a histogram as you saw there so conceptually a very clear understanding of where this p-value comes from how we have this p-value for the difference in means