 Finally, we get there. We're going to take two groups and we're going to compare their means and we're going to work out a p-value We can use a t-test Students t-test at that It's going to be phenomenally exciting to do Once you've done your first one, there's such a sense of accomplishment really Small things, but it makes you happy. First of all, just my style sheet. We're going to import that Let's set up our Python environment Now most definitely we're going to use pandas because we're going to import our data set our Spreadsheet from sci-pi we're going to import stats because it's within that stats that we're going to find our t-test matplotlib always seaborn always And filter warnings from warnings. I want to draw my plots right inside of this web page Playing around with seaborn a bit and I execute my filter warnings to ignore That's the standard for doing t-tests. There we go Comparing the mean of two groups Now when we talk about t-test, we're going to compare two groups to each other and we want to find out whether a variable for them Difference in any statistically significant way now these two groups the two groups are of nominal categorical type So group A and group B group one and group two you can't say or group with appendicitis group without group with hypertension group without Test samples with test samples without you know the the the two groups itself the the data type of the two groups Just the two groups when you look at them they categorical and nominal at that You can't say group two is more than group one. You can't say that group B is less than group A I mean these things there's no order to them because I could have called them blue group yellow and group blue So that the two groups here Categorical nominal categorical type that's when you do t-tests When we just compare numerical values to numerical values that would that would be regression analysis That's something else now the variable that we testing in this instance. We're going to go for whites our count Those are Ratio type numerical data. So the actual variables in the two groups the groups are Nominal categorical the variable inside of each group that we're going to look at is Ratio type so there's an absolute zero numerical and that tells us we can do t-tests. Let's go So we're going to import our mock data Remember, we're going to give it a computer computer variable name The computer variables can hold an object that object is going to be a pandas data frame And we're going to execute the code PD, which is our abbreviation for pandas up top We're going to call the read underscore CSV method and then in parentheses and In quotation marks the file that lives with within the same folder in our computer If I go to my main notebook here, we see we busy with a companion means And it's inside my desktop in a folder called healthcare research lectures So that's where this MOOC mock live so I can refer to it directly because this This python notebook file and this spreadsheet file live in the same Folder or directory on my computer. Let's see. Let's gonna import that Let's see if it has imported correctly and just execute that code indeed. I'm asked for the last three Patient number 148 to 149 50. It looks like my data set imported absolutely correctly Now creating the two groups as I say one way to do it is by this group by method More powerful, but to keep things simple in this Uh, keep things logical in this lecture series. I'm just creating two new data frames the one I'm going to call appendix dot neg and appendix underscore neg appendix underscore pos Comes from the data data frame and Remember how to do this so take that data frame take this column and ask the question in that column data Open and close square brackets open and close quotation marks histo so look in that column look at each row And if the result is true and the question I ask is there no in that row? Yes, if that is true Then put it inside of this new data frame and for pos we want for this column Inside of this so look where to put the square brackets look at the repeat of the name Look how to reference the column name If there is yes in a particular row and it's going to run through all the rows from top to bottom if there is a yes So this equals equals ask the question is it true? Yes, it is true Then we're going to put it inside of that if it was false. So anything other than yes It would return a false value and it would not put it inside of this new data frame So I've created two new data frames Let's just describe them We see the appendix neg Let's just I just want to sell all output toggle. So it just hides all of these before we actually execute them Appendix underscore neg white cell count. So I'm just looking at this data frame which contains all the cases which did not have a histological proof of appendicitis on the resection specimen dot describe method empty parentheses there round brackets and I see the 29 with a mean white cell count. I'm looking at white cell counts of 11 standard deviation of five Minimum was four maximum was 26 high white cell count Let's look at Describing just the data for these patients with appendicitis. They were 118 See their white cell count was 14.6 and those were 11 Can I do a test to say to see whether there's a statistical significant difference between those? Can I do that? Indeed, of course I can now It's easy to look at the describe values there, but we can also just plot the two values next to each other sometimes Or almost always it's better for human beings to look at a plot. Now the plot I'm going to use is the violin plot But I'm also going to show you the box plot. Both of these plots are types of seaborn plots So I've I've commented these two lines the the matplotlib the pipe plot now cns works on top of It extends matplotlib but within this Violin plot method with which is inside the seaborne module It's actually got all of these things built in so you don't actually have to do them That's why I've commented them out But you've got to be careful. It doesn't not always the case sometimes you have to do these two Sydney for violin plots you don't now. What is what arguments does the violin plot method take it takes one? Then a comma takes another one Then a comma seems to be another one there another argument and a comma carries on There's another argument and the last argument now some of these have defaults and you can leave them out definitely The inner you don't have to put in you can Leave it out completely You can leave out color completely. These are the ones i'm playing with But let's run through the ones I have used First argument that must be there is the actual values you want In your violin plot So i'm asking it for the complete Now i'm not referencing I wanted to draw actually wanted to draw two violin plots next to each other So i'm not referencing the values from the individual data frames that I but Created but the original data frame So it's data And the white cell column and i'm using this drop in a open and closed brackets there A Method so in case there are values that are not numerical not a number I just dropped them from the analysis From now that white cell count obviously there's white cell counts in there for patients with and without appendicitis But the beauty about the violin plot Is that I can say group by And it can reference another column in my data set And whatever it finds there of course here is just going to find yes and no But you might have a column that contains all sorts of others And then you'll have more than two plots. There's just going to be two violin plots here Just because there's just yes and no in this column So i'm going to take all the white cell count, but i'm going to split it into different plots Based on what is in the history of plots. So that's very powerful right there and again, I'm just using the drop in a because Well in this instance drop in a is really superfluous because Let's just take it out really That's just confusing the picture Because history just says yes and no there are no numerical values and we might get something Something weird going on there Fortunately seaborn is clever enough to sort that out for you that tiny little mistake. So let's just use that and leave that out Think about these two first arguments though, they are very powerful We're going to reference two different types of columns. We're going to group By this second column and we're going to Take the value from this column for each of the these two groups very clever inner equals points that is just Making my plots a bit more pretty because it's going to plot the individual data points On the plot as well, but you can definitely leave that out Names are important The names go in a list remember a list you create by giving it Just these square brackets and so the first one is going to be no appendicitis The second one appendicitis just checking which order you put them because you might put the wrong name To the wrong violin plot and this color you can definitely leave out This is the code for purple and green or the mirror Let's run this Beautiful two violent plots next to each other. So as I said, just be careful You can do the describe and you could see that obviously this one had a lower median than that So I've labeled them correctly, but there's my no appendicitis I can see a beautiful kernel density estimate there. I can see the actual data points there So that was that inner equals points. So it's not really necessary But you can see that The median was lower for the no appendicitis than for the appendicitis beautiful graph that you can get there From these arguments, you might prefer box plot. You might want to send in Your research article again box plot as part of sns. You need to do these two again the arguments The data that you want is the white cell count drop in a The group by you want it all again to be grouped by let's take away that little Era there data equals histo. So look in the histo column and make difference. So again, there's just yes and no But if there was a maybe in there, I would have three plots The names is just these that you want underneath again. There is a list In the list by that and again, I'm going for this color But I actually Did some colors up when I when I started this sheet playing around with the default values with sea bones I needed to put that but there we go and you can decide which is best. I like violin plots I it looks like there's a bit more information to get from a violin plot, but box plots are the standard again, so This would be the interquartar one one and a half times interquartar range So anything outside of this would be statistical outliers and it'll mark those for you with these little Spots there, but certainly there's the first quartile the median The third quartile Etc beautiful And it actually gives a name there it came We made the decision based on the histo plot no penis scientists appendicitis now We've done the doc described function for the white cell in the column We noted 29 values negative 108 positive, but we want an answer now We want to know is there a difference between the white cell count between these two groups So what are the steps that we run through? So you've got to have this in your mind This is what you're going to do every time eventually this will become second nature and all you want to do is just Write the code and get your results, but let's run through these steps Step one you have a burning question Is there a difference between admission white and an admission white cell count between patients with an out with an out acute appendicitis on histological evaluation because eventually you want to say Can I use white cell count as a discriminator to this to make a decision? That would be one of the first steps in doing that. So that's your burning question Step two do a literature review this question might be out there You might learn from what has been published before how people went about it how to construct your research Very important to go search for previously published articles pertaining to this question you have get ethical approval if required You don't do any anything don't start doing anything until you get ethical approval If required you might deal with with things that you don't need it for step Decide on the variables to be collected to answer your question Now I need to know do they have appendicitis or not? The only patients that I really have because this is this is a Data this is an event that's already happened. So I'm I'm going to have to get this from the patient files And I don't know who didn't have the only discriminating thing I can really have is to look at the Operative specimen was sent away for analysis, and it will say on the histology either there was or was not So I need to look at my histological results And then I need to go to the lab and for those patients draw that they first white cell count the white cell count That was done on the admission So those would be the variables I collect so that I can form my two groups and the actual numerical values I want to compare between the two groups. I've got to decide to answer my question. What data do I want to collect? Now I'm going to set my hypothesis So this in step six must happen before any data collection now Unfortunately, we already have it there, but I've got to do something for these lectures. So imagine we haven't collected any data okay Our null hypothesis is that there is no difference between the admission white cell count and those within with our Testological proven acute appendicitis And our alternative hypothesis or our test hypothesis is that there is a difference You can make an argument that we could go for a one-tail test here We could say look the white cell count is going to be more in those with But that's a very dangerous thing to do because it might very well be that acute appendicitis leads to overwhelming sepsis The immune system can't keep up and those patients will actually have a lower white cell count. How do you know? You know, you don't make wild guesses and estimates like that in in in in clinical research Stayed safe with a two-tailed test. We say there is going to be a difference hypothesis stated done By my hypothesis, I'm going to do a two-tailed t-test done I cannot go back and change that that is set in stone I'm going to decide on an alpha value. That's my risk of making a type one error or more practically that is my my my cutoff for statistical significance I'm going to choose a alpha value of 0.05. So a p-value of less than 0.05 will be clinically significant But that is actually my risk of of making a type one error that is falsely rejecting the null hypothesis This means we will have to construct a t-distribution So that's what the computer is going to do t-distribution because remember we don't know what The larger patient population parameter is The standard deviation is for white cell count. We don't know that we only have our own data So it's going to Calculate the two means it's going to calculate the difference between the two means And it's going to convert that into a t statistic Which will then be plotted somewhere on the x-axis on this beautifully Normally distributed bell curve and work out the area under the curve for that Intellis where this falls It's going to be split on both sides remember because this is a two-tail test. We collect our data done Now we have got to choose an appropriate statistical test Now we are comparing two categorical groups with each other, but the variable itself is ratio type numerical So I can use a t-test for that Am I going to use a parametric or non-parametric t-test that would be question number one And how do we do that? Well, we need to know whether our sample Samples our white cell counts for the two groups. Do they come from? A normally distributed Population parameter qq plot to the rescue remember that comes from the stats Submodule not from sns. That's why I had to import from sci-pi. I said from sci-pi import the stats sub module Now you've got to be careful when something is a sub module when you import it like that We actually have to reference the whole thing So stats dot I want the probe plot the qq plot probability plot inside of the stats sub module. So you have to do that If I said from sci-pi, let me let me just write that out to make it abundantly clear If I said I'm just going to comment the line out if I said from sci-pi Dot stats import Prop plot if I can any spell prop plot Then I could have just said I could have used prop plot directly because prop plot is a method whereas The stats module is a sub module I can't stats use stats on it on its own. I have to use one of the Things inside of stats. Okay, so that's the difference. I said From sci-pi import stats stats as a sub module So if I want to use anything with inside of and this is something you have to get used to inside of stats I have to still now say stats dot prop plot because I initially just said from sci-pi import stats Open and close there. What does it take? It takes three arguments. Remember the data I want I want Appendix neg remember that was the one of the new data frames I I made just with the negative patients I want the white cell column in there drop in a's I want the dread line to be a straight line a normal distribution and what plot well Please use matplotlib to do this plot for me Remember that just means it doesn't give you all sorts of numerical values It just draws the plot directly, but you don't need that semicolon really Let's run that And look at that That's an 0.97. That's very close to a normal distribution I'm going to make a judgment call and say that's good enough for me. I think that my sample of appendix negative white cell counts came from a Patient a population parameter, which was normally distributed So I think I can use the students details. Let's just look at the other one The other group would be the appendix positive white cell counts. Let's do their probability plot And look at that there were more of those and we actually get a 9 0.989 So even a better QQ plot there So I think that white cell count comes from a from a normally distributed That's a normally distributed parameter in the population. I can use a parametric t-test A parametric t-test, but that's a judgment call. There really isn't any hard and fast cutoffs Now I'm not done yet. They are there are various kinds of t-test parametric t-test I now just have to say I have to now look at the difference in variance between the two groups I'm going to say appendix the white cell count column. Just do for me var. That is variance 26 And for the positive group that is 21 Because there are two tests for equal variance And it'll do the the mathematics behind it pulls the variance of the two groups and there's the Non-equal variance t-test and it doesn't pull the variance. It's very technical stuff We need to be concerned about that You've just got it in the context of the type of values that we're dealing with 21 26 It's really close to each other. I'm going to make a judgment call and say it's equal variance two These two variances are close enough to each other. I'm going to use the I'm going to use the t-test for assuming equal variances It's actually assuming equal variances in the population, but all we have is our sample. Okay, let's count Lastly one more thing. Are these groups independent or not? So a dependent groups would be if I have sets of identical twins Or if I ran a trial with an intervention and I had the same patients I do a test on them. I do an intervention and in that exact same patients I do that same test again So it's That exact same patient in each group Those groups will be dependent. These groups are totally independent These two sets of patients in my with and without a pen the size of nothing to do with each other So I need to know about equal variance and I need to know about dependents So this is independent two groups. I'm going to use the test for equal variances This is how I'm going to construct it. It's called stats dot t-test underscore ind And that is stands for independent t-test independent Here it is and all its glory now the stats dot because it's part of stats. I've got to refer to stats t-test underscore ind gives you back Two values it returns two values. Therefore. I have to give it two computer variables This is a strange thing to do. It's the first time you're going to see it I have to use two computer variables separated by a comma It's going to give me back the t statistic and the p value gives me back two results You can call it whatever you want. I call the t underscore underscore stat and p underscore val whatever I want equals There we go stats dot t-test ind Okay, there's also one for dependent t-test independence what we do and here are the arguments that you have to do You have to tell it The two groups So these type of t-test in t-test are for two groups and nova is analysis of variance That's for more than two groups. We only have two groups here There you can put them in any order. There's my first group The white cell can values. There's my second group. I'm using the safety net of dropping the na for both of them But it's an appendix negative data frame the white cell count column appendix positive data frame and the white cell count column And then I say equal underscore var. That's the other argument equals Let's just redo that equal underscore var equals two now That is the defaults. I needn't have put it in if I decided 21 26 are too far apart I would have said equal underscore var equals false. It will do a different type of test And all I want it's going to return two values for me Which I've called that as computer variable one that's computer variable two And I'm just asking you to print to the screen p-value. Let's look what you think it's going to be Boom 0.019 there is a my alpha values open 05 So I can to say my report the p-value is less than 0.05 There's a statistically significant difference in admission white cell count between patients with and without with and without Appendix scientists just to show you the t-stat that it worked out as well It was negative 3.16. That just depends on the order in which I put these two Beautiful we've done our first t-test and we've gotten a result Okay for fun. Let's just do something else. Let's do two t-tests for unequal variances as I said exactly the same I'm just using two new computer variables. So it's not too confused with those two So I've just given it two new names exactly the same here. All I've done is I said equal var equals false And as I say this is this for fun. This is not so in this It's also called this wealth t-test or welch welch welch Aspen test For it's an ampoule t-test for unequal variances. I just want to show you the difference That's a p-value of 0.0049 And that was 0.0019. So see there is a difference The mathematics behind it is slightly different. So you've got to make that judgment call All right, there's a difference between those two now Surely they're both lower than 0.05 but you can you can well imagine a situation where it's just going to just be over or just be under And that is where ethics come in. You've got to be ethical about your results just to be Just to be even more What I should say there is State your variance in your in your publication So people can decide people who read your research can decide was it right to use an equal or unequal variance t-test Don't hide that from your readers. That's unethical research Okay, and I imagine I decided that those two qq plots were not Showing me that it was not from there. That's not so in this case. I'm only using this as an example The test that you would do there is called the man whitney u-test the man whitney u-test in python is for is for Independent groups and the man whitney will coxon rank sums test Is for dependent groups. I just remember Sometimes those two get confused in python this one will coxon one is for the dependent groups and the man whitney u is for Unpaid independent groups It also gives you back two values So you have to say t-stat u and p-value whatever you want to call the two the first one is going to be the t-statistic That is how it converts the difference in means that you get remember that there's a mean white cell count for one group We mean why also count for the other group you subtract one from the other In which order you want you're going to get either positive or negative results That's going to convert it to a spot on the x-axis of our normally distributed bell curve. That's the t-stat p-value is the one we're interested in You can call those to whatever you want, but you have to put two separated by a comma and it's stats dot man whitney u Now the man whitney you do you all it takes is the two sets of values separated by a comma So it was from this data frame the white cell count safety net drop na from this data frame the white cell count values Again safety net drop that now By convention in python the man whitney u test is going to give you back a one sided t-test and you have to multiply it by two And this is where ethics comes in again. You really have to be ethical about your research. If you do this Do not now post Huck try and change your hypothesis. We said to tail. We can't change it now Now that gave us back the to tail t-test but here we have to we have to um We have to multiply our value by two because it only gives us back this test man whitney gives us a one tail value back So let's multiply that by two and we get open oh one four So it's slightly different from the ones we've had before So there you go our first t-test wonderful