 So in this tutorial I want to talk to you about the common test for Analyzing categorical variables. So we have nominal and we have ordinal categorical variables where we have a sample space of very Defined elements we could perhaps extend this a little bit into the world of discrete numerical variables If we're only interested in how many of them occurred and not the actual values themselves So let's just stick to categorical variables. We have categorical variables We have a sample space of these defined values and we want to analyze them statistical test for categorical variables Now I want to talk to you about the three most common ones And we're going to use the r programming language as is the norm for this playlist And I'm going to talk to you about those three common ones So first of all, there's going to be the chi-square test of goodness of fit so that's when we're going to have a single variable and We're going to have these elements in the sample space And we're going to count how many each of them occur in a study that we do and we're going to see if that Proportion how many of each someone chose or how many of each there are is that different from what we would expect There to have happened and then we're going to have the chi-square test of Independence that's where we're going to compare two categorical variables against each other and the question that we're asking is The outcome of one dependent on the outcome of the other So if we count the number of elements in the one versus those same elements in the other and we compare those against each other Is there a difference in or is there dependence between those two categorical variables? And then lastly, you'll also come across fishes exact test now that exact word in there That doesn't mean it's a better test we use it when we have very small Sample sizes so if we look at those values and many of them are less than five Subjects in each of those values then we're going to use the fishes exact test and the exact comes from the fact that we calculate the p-value Exactly or directly in other words We're not going to first calculate a statistic a test statistic And then see where it falls on some distribution curve and work out the area under the curve No, no, we're going to calculate the p-value directly and that makes it an exact test So let's have a look at these it's the chi-square test of goodness of fit for a single variable and then the chi-square test for Independence or the fishes exact test where we compare two categorical variables against each other. Let's have a look So here we go looking at the test for categorical variables Now what we see here is the rendered html file on our pubs Remember that the RMD file the R studio file will be available on github for you to download and of course You can just view this file as we have here on our pub So let's just go through that I'm going to talk about the chi-square goodness of fit test and the chi-square test of independence and Then also the fishes exact test. So those would be the most common test that we'll see for categorical variables now You would notice there the two types goodness of fit and test for independence the way that I want to explain it here The goodness of fit test is that we're really just going to look at a single variable and For that single variable we'll have a sample space and we'll count the number of occurrences of Each of the unique values in the sample space when we collect this data for a variable So imagine that we were to roll a fair die 1200 times now a normal die has six sided six sided So they six faces and if it's a fair die, we would expect that every face would land up about 200 times So it's a single variable, you know the the variable is the face that lands face up the number that lands face up and We're just counting how many of each of these occurs Occur whereas the test for independence that is where we're going to look at two categorical variables And we're going to see if there's some dependence between the two the null hypothesis being there's no dependence between the two so we're really looking at proportions here of Sample space values for categorical variables So let's just start off with the chi-square test of independence. So the chi-square goodness of a fit test I should say so let's consider a very simple example we're going to take a hundred samples from a population and We are going to ask them a question in a survey and they can choose one of four options Strongly disagree with a statement. We might make disagree agree or strongly agree So there's only four to choose from and let's just say for argument's sake We expect an equal distribution of those four answers. We don't expect that people will choose one over the other So our expected distribution is this 25 each So if we give it to under people we expect 25 to choose strongly disagree 25 would disagree 25 or agree and 25 I strongly agree and now we get back 10 30 35 and 25 for each of those So this is an example of a multinomial categorical variable There are more than two the sample space has more than two if they were just two yes or no for instance That would be binomial, but this is indeed multinomial there are four elements in the sample space of this question variable and So let's assume that 10 people chose strongly disagree 30 disagree 35 agreed and 25 strongly agreed So if we mark each of our if you see down here We mark each of our observed values of uppercase y sub i will have that y sub 1 is 10 Y sub 2 is 30 y sub 3 is 35 and y sub 4 is 25 So there's a total of 100 so n equals 100 subjects and the probabilities that we expected if we could write that here is Lowercase p subscript i p1 will be 0.25 and so on until p4 we expect that And then the general form that these types of tests take is is that we sum up The square of the differences between the observed and the expected values and we divide that by the expected So that difference again is going to always be positive because we squared it and then we divide by how many is expected So if we just look at the single variable What would be the observed while that would be all of the y i's so the 10 the 30 the 35 and the 25 What do we expect what does think about it if we expect a quarter of people to choose and there's a hundred people So it'll be the hundred people times a quarter So that's where we get the n times p i so there was ten and a quarter of 125 so there's gonna be 10 minus 10 minus 25 for the first one we square that difference to make a positive and we divide it by the 25 and We add up all of these differences And lo and behold we get a chi square value So let's just do that inside of code the long way So we're gonna create a computer variable here We're gonna call it just why and we attached to that this vector of values So we use the c function the 10 the 30 the 35 and the 25 that's our observed count Now the probabilities We're going to create a Vector called p there and we'll use the rep function because we're going to repeat 0.25 four times So it's just gonna be a vector of 0.25 0.25 0.25 and 0.25 and the sum We're just summing up all of these the 10 the 30 the 35 and the 25 and that's gonna get to a hundred So let's work out the sky square value. So we're gonna sum all of these things So it's gonna be y minus n times p squared divided by n times p And because all these vectors are of equal length, it's going to be no problem We're gonna have this broadcasting where it's going to be element by element at least and we're gonna get back a chi square value of 14 now this chi square value follows a chi square sampling distribution and You would remember from chi square sampling distributions You might really know you might remember that these look different depending on the degrees of freedom So we've got to work that out and that's simple enough and as much as the degrees of freedom that we have here Is how many values they were in the sample space? They were four minus how many variables were there? Well, there was just one in this instance So we have three degrees of freedom So we we're gonna create that as a little computer variable DF Which is just gonna hold the value three and then we can use the probability chi square test to work out a p value for us and the first argument is going to be our chi square value 14 the second is going to be the degrees of freedom in the last one We say laywood lower tail equals false because what we want is the area under the curve from the value 14 Towards positive infinity and get gives us a p value of 0.002 So they that this we can say that the proportion that we found the 10 the 30 35 and the 25 Having expected an equal distribution of those a uniform distribution I should say is statistically significant our value was a rare value to find Now we don't have to do any of that We can just use the chi square dot test at CHI is q dot test And we just pass the vector values and the p the probability remember that's going to be 0.25 0.25 0.25 that's the whole vector and we get nicely back a chi square test for the given probabilities We get a chi square value 14 degrees of freedom of three as we expected and the same p value So simple enough if you ever have that Situation where you just have the single variable and they are categorical or discreet numerical as well remember if if if we can just look discreetly at each individual value or then More notably a categorical variable and we have that defined sample space. We just count the proportions Now as I mentioned the chi square test of independence, that's a bit difficult We're going to compare the the proportions at least of two categorical variables against each other So as always, let's just look an example it makes life very easy and We are going to have a categorical variable called group and the sample space is going to contain two elements group one and group two So imagine a group of patients they fall either into group one and group two You can imagine that they get different types of medication a placebo and an experimental drug for instance and For some variable, we're going to call it outcome That outcome variable is also nominal categorical and it contains Sample space of three elements worse same and improve. So we just decided this patient got worse This one got the same and that one improved and You can see the values that we have there the total values 44 worse than 72 state the same in 55 improved So when we break this down, we see the 33 subjects in group one improving 44 staying the same in 25 Improving and we have 11 28 and 30 for group two for those numbers So how would we represent that while we do that isn't what is called a contingency table a little matrix so one way that we can do this is just to have two vectors and We just row bind them so each of the vectors will now form a row And I'm going to store that in the computer variable obs OBS So I'm going to our bind my 33 44 and 25 my 11 my 28 my 30 and just for the sake of Argument as you can see nicely printed out below here I'm going to put these column names and row names in there So I'm going to say row names obs and I'm going to pass the string vector Group one and group two to it and the column names going to be worse same and improve So when we print this OBS out it looks very nice because we can immediately see in group one They were 33 that got worse 44 that stay the same 25 that improved 11 in group two got worse 28 with the same and 30 improved now we want to ask The question was the outcome of the patient was that outcome dependent on which group they were in that's the question We're asking but you can see both are nominal categorical variables that we want to compare to each other So that's our table of observations our observed contingency table two rows three columns two by three You can have three by three four by three Doesn't matter the size that we have so you can have more elements in each of your categorical variables now we need to work out an Expected table and we really if you read Daniel says let's start to the number of expected subjects in group one That worsened the observed count was 33 now given that there were 44 subjects who worsened So if I count up this column 33 and 11 that gives me 44. So not that 44 44 worsened and There were Who worsened in 102 subjects in group 1 so if I just look at group 1 and I add 33 44 and 25 I Get 102 So if 44 worsened Worsened overall, but group 1 only contained 102 I can work out what I expected this 33 to be by very easily multiplying this column total Times this row total and I divide that by the sum total there were 171 patients in this whole a Group so that's 44 times 102 divided by 171 that gives me 26.2 So we would expect Given that there were 44 people who worsened and that there were 102 people in group 1 out of a sum total of 171 We would expect with those proportions there to have been 26.2 people It's going to be a fraction in Place of the 33 and you can work out all six values It's the column total times the row total in which that value appears Divided by the sum total and if you do all of that you're gonna have these six values in what is called an expected table Now we don't have to do all of that. This is our for statistical programming So we can just use the chi square dot test function. We pass our observed table Now we don't have to put these rowing column names in there like I did all you need is this this matrix of values And we say correct equals false. I'm not going to get into that We don't want Yates correction Yates is correction there and that gives us a Pearson's chi-square test And we see a chi-square value of eight point nine seven six degrees of freedom of two and a p-value of 0.01 so that is smaller if we choose an alpha value of 0.05 that's smaller So we definitely say that your outcome was dependent upon which group you were in and Now you can start looking at the different Percentages you might express just to see, you know, which group if you just look at improve or just look at worse And you can sort of express in words what this table is trying to tell us But definitely there is dependence on these two now also note. I could have Transpose this I could have worse same and improve on my rows and groups on my right I should say on the columns group one and group two So I would have three rows and two columns that makes no difference at all And it also doesn't really make a difference in which way you explain it. You could say The group that you were in were dependent upon whether you got worse. You stayed the same or you improved We just attach some human meaning to it by suggesting that your outcome is going to be dependent on Which group you were in so I'm oversimplifying there slightly but I think that explains the situation Now the last test I just want to talk about is just fish's exact test. It has the word exact in there and It might sound like it's a you know Of higher quality in some way a better test But it is not exact. This means we are calculating a p-value directly We are not first calculating a test statistic like a t statistic or chi-square statistic That falls on some distribution curve and then we work out the area under the curve That's not what we do an exact test an exact test We just have an equation and we calculate it and there's a p-value and that's what the exact means It doesn't mean it's any more exact or any better than any other test. So let's look at fish's exact test Now these chi-square values they fall on this curve this chi-square distribution curve But when the numbers start getting small like you only have less than say five Subjects in each of these in each of these or at least 80 percent of these you have values of less than five So that was four and three and one It's not going to follow that chi distribution curve very nicely when the numbers get small things go a little haywire And in those cases we can use fish's exact test now fish's exact tests are going to work for these large numbers But we use what is called the factorial calculation And the factorial is very easy if I say five factorial It means I go from five and I count back and I multiply so five times four times three times two times one So two factorial is two times one, which is this two three factorial is six because it's three times two times one four factorial is four times three times two is 24 And then 25 and by the time you get to 10 you're talking about very big numbers. It really ramps up quite a bit so 44 you know that you're going to get to a number very quickly that is out of limits of what your computer can actually calculate So don't overuse the fish's exact test is because there's an exact in it Use it where it's proper and that's where we have the small values And there's the equation for it now it also uses a contingency table But in this simple form you only you can only have four values in your contingency table. So it's one Categorical variable against the another but the sample space of both of those can only be two So in this instance, we have a Two row by three column we can't have that we can only have two by two So what you would have to do is by some logical argument because you're an expert in what you are researching You would have to combine two of these so you might put Same and improved as one and add this 44 and 25 and add the 28 and 30 and make that just worse and not worse Or we could have the worse and same together and saying not improved and improved So that you have these two by two and if you had more than two groups You would also have to combine them in some way And then we're going to have the fact that from the top left to the bottom right we're going to have If we just look at this one, we'll have a b c and d so from top left to bottom right It's going to be the first values a next to this b Drop down will be c right on the right bottom corner is d and if those are our four values in our contingency table We add a and b and we take its factorial c and plus c and d it's factorial and we add a and c and it's factorial and b and d and it's factorial And we divide that by a factorial times b factorial times c factorial times d factorial and then in all of them combine. That's the sample space the whole sample size I should say factorial So let's do a little matrix and We call it vowels and we're going to have two rows. So it's going to be two rows and two columns because I have four values there and Just to show you I didn't print out vowels to the screen here, but you can also add your Row names and column names there. You don't have to do this at all You only need the matrix and we use fishes dot test And we pass the two by two matrix to that the contingency table and we can see What it works out for us what it works out for us there So fishes test the very simple So remember we're going to use the chi square goodness of a test for a single variable chi squared Test of independence if we have two categorical variables And then if we have very small numbers very small sample sizes, we can use fishes exact test Remember that for these tutorials on r that the actual html rendered files are on r pubs And that's what you might see on the screen But these files are also available in their raw form on github and all the links will be in the description below So you can either go to the website and look at the r pubs files as they're already rendered Or you can go to github and download those files Into your system so that you can use them in our studio yourself So if you like these videos on r, please let me know so that I can make more of these or their subjects that you want Me to cover as far as bi statistics is concerned and the use of r Please let me know otherwise Please always remember to subscribe and hit the notification bell so that when new videos come out you will know about it You can also follow me on twitter because that's where you'll also see that new videos are out