 So in this lecture, we're going to look at comparing categorical data. Let's set up our Style sheet. There we go. Just the normal things. I've also added image this time to the ipython.core.display package here so that I just So that I can just display some images on this on this notebook on this web page So what are we going to need to get the job done? I'm going to import numerical Python numpy and I'm going to use the abbreviation np I'm going to import pandas as pd From scipy.stats. I'm going to import base underscore mvs And then my old friend matplotlib.pyplot as plt, cbonus sns spujl warnings from warnings import filter warnings I'm going to do my plots inline and filter warnings ignore might not use all of them But those are the standard ones. Let's compare categorical data now in the previous lecture you remember we had two groups and those groups were of Nominal categorical type in other words, it was with appendicitis without appendicitis There's no way to order them. It's not ordinal categorical data. It's nominal categorical data As I said, I might have called them blue and pink What there's no way to order that but the variable in each of those two groups that I wanted to compare to each other That variable was of ratio type ratio numerical type now What if that data was also categorical in some way? What if it was also nominal categorical or even ordinal categorical what I'm thinking of here for instance a Surveys with a like a type scoring system Strongly agree if you give a survey survey survey to someone and you ask them to fill in Based on a statement whether they strongly agree agree neither agree nor disagree disagree strongly disagree That would be an ordinal type ordinal type categorical data. There's some order to them It's still categorical because you can't say someone who chose Disagree is twice as unhappy as someone who agrees etc But that is all categorical data and I've got it in two groups or even more than two groups be able to compare categorical Variables with each other as well to the rescue comes the most commonly used Statistical test the chi square test. So let's just import our mock data there We've done it a hundred times before not a hundred pandas.readcsv as soon as I use the readcsv method there and in brackets there and in Inverted comma say in quotation marks, I should say mooc underscore mock.csv It's a comma separated value file as soon as I use that put it inside of a computer variable That computer variable becomes an object and that object contains a pandas data frame We can look at the first five rows Just to make sure it looks like everything imported beautifully. No problems there So the chi square test I'm going to explain it by way of an example So imagine we could just able to to to make two groups those with and Without appendicitis and we want to see whether the incidence of Retroviral disease is different between the two groups. I might as well have said those with an outer retro viral disease and our study group And what's their difference in the rate of appendicitis or not makes more sense the other way around but it's gonna work out exactly the same So I have to create a little table and that table is called a Contingency table and it contains a certain number of rows and a certain number of columns It's a neat way to order things That as I said that table usually two by two But you can extend that two by two meaning two rows two columns But you can extend it even more if you think of a five point like its scale Of course, you're gonna have at least five rows on the one side Or if you wanted to put that in the columns, you're gonna have five columns Let's use the Let's use the group by function this time. So I'm gonna create a new data frame Data was a data frame if I create something new new computer variable That would also be a data frame and this time instead of making Just new sub data frames remember we used the Boolean operators equals equals or Etc. This time I'm going to use the group by we have seen it before so I'm gonna call it a new data frame name You still underscore group equals data dot group by and Then you've got to tell it what column to group by and if you use the column there You've got to reference the data name the data frame name again. So it's data square brackets Single quotes there history you can use double quotes as well So it's going to split this data frame in two based on all the different values It finds in the history column now history columns only going to contain no and yes, so it's only going to have no and yes split in this data group and Let's just see how that works if if it's split now and I look at the RVD column of this new split up data frame He's done a score group RVD that column and we're going to just do value counts again What does value counts do remember that finds all the whole variety of values that it can find and it counts each of them Again, we're just going to have yes and no's and this is what we find It tells us that our data frame is grouped by Histo column and what did it find in the history column? No and yes No and yes, that's all it found that is all that it found and Inside of that it looked at the RVD column and in the RVD column it found no and yes as well But it now will do that for both it will do that for both of these group by splits so if the astrology was no there was 16 negative patients and 14 positive patients for RVD is concerned and those with appendicitis They were 80 RVD negative and 40 RVD positive now we can ask that read this but it tells you how we split the data frame Now from this it's a linear a little line there. We've got to create this row in column Effect and you've got to do this by hand and what I'm going to use is this numpy dot array Remember we imported numpy as NP So I'm going to do this array using this data up here I've got to do this by hand and I'm going to give it a computer variable name You can call it whatever you want I called it histo RVD observed because that's what we observed in our study this contingency table two by two is going to be a It's a contingency table. It is our observed values now look how I have to construct this array So very carefully. I've got my round brackets there to open and close this array Method then square brackets for the whole thing So there's an open square bracket there and an open square bracket there and then I'm going to write down each and every row split by a comma So that first row go go it goes in its own square brackets 16 comma 14 Let's go 16 14 and the other one 80 40 so These round brackets remember they follow every method and inside of them go the arguments Inside of the though it takes a single argument and that single argument goes in these two square brackets here And then every row goes in its own square brackets split by a comma. Let's see what that looks like Now it's a two by two two rows two columns my contingency table of observed data Looks like this Now I made a little figure there just to clear things up so you can see exactly what it means So 16 14 18 40 16 14 80 40. That's what we have there. So on the top remember from this That will be the RVD status and on this side the histo status. So no and Inside of no histo. No, I had no 16 14. Yes RVD. No 16 14. Yes RVD. No and yes and for the histo. Yes group. I had RVD. No 80 RVD. Yes 40. So there we go 8040 and this is exactly what we've created here by this NumPy array once I have an array like this I can now do the chi-square test. So you've got to do that step now the Method that we're going to use is from the sci-pi scientific Python module the stats sub module and it's called chi to underscore contingency open and close these round brackets and it takes four it takes a single argument and The argument that it takes is this array And there I created it and I gave it this name so it's going to take that value inside of there Now let's just see right up there Now import now see this was never going to work because I never imported it directly like that. So let's do that chi to underscore contingency and if I hit the tab key it will it will Auto-complete that for me. That's wonderful to do now look what happens if I rerun this again I remember I've run some others. So this is going to be out of sequence if I run this now That becomes seven. It's now the seventh set of block of code that it's executed and Python does that and I Python does that it's not going to see these in order as the page goes down It's going to see these in order as how you execute it So the latest one that you've executed will be what is in memory and that does can sometimes catch you out when you start Changing what is inside of computer variables what values are inside of they that you have to remember what the last one was So back to the sky to contingency which will now work Well, would it have worked? That's the question. That's another interesting question Remember, I didn't say from sci-pi import stats. I commented that line out Now I said sci-pi dot stats import base and that directly so look at that another little catch up These kind of mistakes keep in all the time until you get used to them So it's good to see them here. So I need not put or it won't work. Actually if I put stats dot I've got to just use the word itself chi to underscore contingency. That's it Back to what it does it returns four values. Therefore, you have to attach to it four different computer variables It returns the chi squared value. That's why I've called it chi well It returns our p-value of probability So what's what is the p-value for is there a difference in incidents in RVD status between those with and Without the appendicitis or you could say it the other way around because this is two by two It's gonna also give me back to the degrees of freedom and expected it gives you back. Well, the expected table is if you had in this total so 80 plus 16 that is 96 and that and 54 and In this are a 30 on this side 120 on this side So if you just add the columns and you add the row values There's a little equation that you can do to give you an expected What would you have expected and it's going to take your observed and measure it against this expected using the degrees of freedom To give us a p-value That's not important what's important it comes in this order. So give it names so that you so that the names make sense so let's run that and Let's print these to the screen. So the chi squared value That was 1.3 It doesn't really mean something what we are after is the p-value and that was 0.25 So there was no statistical statistically significant difference between these between The RVD status in the histology negative and positive group Just to show you the degrees of freedom We're not going to do much for that just as one degree of freedom in my two by two contingency table and that was the expected Given the total if I were to total these rows Individually and total these columns individual individually and have a grand total of all the patients This is what I would have expected to see 19 10 76 43 and what I observed was very close to that Hence my p-value being not being statistically significant and That is the chi squared test for Categorical data Just to remind you you've got to create this little table one way to do that is to do this group by function Just so that you can get your values there You have to think about what you want from your data set and this is the way to create your array Now it could have been bigger. You've could could have had another value there You could have had a comma there and another square bracket sets to make more rows Etc and then very simply you run the chi square contingency function there It's one argument Which is just your values and our pops four different values and it's the second one there I Could have called these whatever I wanted But it gives you back these things in order so name them like this so that you remember what each one is So that's quite an easy test to do the one thing that can catch you out as it did here now Example remember how you imported things. That's very important if I just said from sci-fi Import stats. I could have just referred to stats dot chi to contingency But if I imported it like this from sci-fi stats import chi to contingency, I just use the word contingency excellent