 In this lecture, I want to talk to you about the parametric versus non-parametric statistical tests. So we're almost there. We're almost going to import some data and compare groups to each other and calculate p-values. One step before we get there though. Parametric versus non-parametric statistical tests. As per usual, we're just going to import run there just to import our style sheet. Now let's set our environment. I'm going to use from scipy module. I'm going to import the stats sub-module from matplot, or I'm going to import matplotlib.pyplot as plt as usual. I'm going to import c-borne as sns. And then from the warnings library there, I'm going to import filter warnings. As per usual, I'm going to use my magic command there matplotlib inline to draw my graph right on this web page. This is just playing with c-borne, just making some default values so that my graph look a bit different. Then I'm going to execute filter warnings ignore, just ignore those ugly pink warning messages. So remember previously from the central limit theorem that if I were to construct the histogram of how many times certain means will occur and some will occur more commonly than others, slightly more difficult than that. As I say, they're no surprise. In the grander scheme of things, there really are two different types of statistical tests called parametric and non-parametric. And the reason why we make this differentiation is because if you use the correct one in the correct situation, you actually get a better statistical result. It is more representative of the population out there, so you get better inferential statistics. So I want to introduce to you this QQ plot. QQ plot is part of the SciPyStats library. Very good to run your data past the QQ plot first before you decide whether to use a parametric or non-parametric test. So parametric test says to a population parameter, it suggests that the sample that you drew, you drew a sample of say 30 subjects and you take a certain variable and you take the values. We suggest that the values for that variable come from a population in which those values are normally distributed. So the parameter, when we talk about a variable in a population, we call it a parameter. So we say the parameter is normally distributed. So if I could do, as far as human beings are concerned, get the values of all six billion people for one variable, I would call that a parameter. And if that formed a normal distribution, a bell shaped curve, I would use a parametric test on that data. If the values in the population are not normally distributed, I would use a non-parametric test. The QQ plot is really there to help us decide whether our samples, because now remember that's all we have. We don't have access to all six billion results. We only have, for instance, our 30. Is there a test that we can do to suggest, so any suggestion, that those 30 were taken from a normal distribution or not? QQ plot to the rescue. First of all, let's just generate 40 random variables. We've done this before. We've played with this sort of thing before. I'm going to create a computer variable called rv1 underscore values, call it whatever I want. I'm going to call up, because from scipy I imported stats. So I have to say stats. before I can get into the stats sub-module. From that I want the normal distribution, so .norm, .rvs, the rvs method there says give me some random variable set, please, but take it from a normal distribution. Take three arguments, lock, scale and size, separated by commas. The first argument, lock, is the mean. So please let that normal distribution have a mean of 100, scale is the standard deviation of 20 and from there please draw 40 samples. So I know that these 40 are going to come from a normal distribution. I've asked the computer to do it for me and I've taken 40 of those. Every time you run this you're going to get a different 40 from me. Now the QQ plot is otherwise known as a probability plot, a probe plot and that is part of the stats sub-module, which is part of the scipy module. So stats.probplot takes three arguments, it takes three arguments and we'll have a look at that. What it really does, it checks then whether your values fit a normal distribution. It plots, every value it says there, plots the data points against their own quartile. That's all it does. It's then going to give you this R squared value as well and this R squared value goes from zero to one. One means there's an absolute fit and zero means there's no fit at all. It really does not come from a normal distribution. Let's see how we construct that. We're going to tell Python PLT, remember there was matplotlib.pyplot, it's abbreviation PLT. We're going to call the figure module. It says get ready to draw a figure, make the fixed size 8.8 wide and 6 high, so that's to construct it. Strictly speaking we don't need to use it here because we're using this propplot but you can do that, no harm in that. Stats.probplot as I said is three arguments. First is the values you want plotted and remember there's my 40 values, they are called them RV underscore values. Dist equals norm, so you've got to tell it what kind of distribution you want to test your values against. Dist equals and there's various ones you can do, norm is the one we want and it's got to go in inverted commas either in quotation marks, either single or double, doesn't matter. And then you say plot equals PLT, so use matplotlib.pyplot to do your plotting for you please. So it takes those and then eventually I say plot.show. If you do stuff and your plot doesn't show, remember just do plot.show.openclose parenthesis with this semicolon, let's run that. And there we go, beautiful, a beautiful, beautiful QQ plot. So this red line represents a normal distribution and the blues are actually your 40 values and can you see they are very closely, you can imagine they fall on this line and we can see an r squared value, they are 0.9956, so every time you run this you're going to get something different. That's very close to one and remember I set it up so I cheated a bit, I took it from a normal distribution so I better get a high r squared value here, but you can see that's what it looks like. Let's do something else. Take it from a different distribution, rv2, two underscore values, stats dot this time not from the normal distribution but from the exponential distribution, dot rvs says give me a random variable set and this time it doesn't ask the exponential distribution it doesn't ask for mean or standard deviation, just the size 40, just give me 40 values from the exponential distribution and let's plot that, see here I didn't do the plot dot figure, I didn't do the plot dot show, strictly speaking the stats dot prop plot has all of that built in, sometimes it's difficult to know when stuff is built in and when not and you just have to get used to the different syntax. So stats dot prop plot again the three arguments, the values that I want, the distribution against which I want my qq plot to be drawn and plot, please use matplotlib dot pi plot, we can just run that straight off the bat and look at that as I say just a warning you're gonna get something different but now these blue dots clearly don't really follow, I mean there's a bit that does follow but it's all over the show and I get an eigen r squared value of 0.86. Now it's a bit tricky, you know where is the cutoff, is there some magical cutoff, no there isn't. Clearly though you know here the low 90s into the 80% here 0.86 and I know I cheated, I know this is not a normal distribution, you can see this is an example, you can almost see it makes a curve, it must almost be like part of some exponential curve isn't it and an exponential curve is not the straight line, so really that's the kind of values that you're looking for, that sample set does not come from a normal distribution, therefore you have to use a non-parametric test. So if we just look at our normal t test if we're dealing with numerical data, white cell count, black pressure, systolic black pressure, some value for lab result, you know as a white cell count or urea, whatever, those are ratio type numerical data, the quint essential parametric test for that is the student's t test, student was a name under which William Gossett, who while he worked for a brewing company, published under the pseudonym, story goes that they didn't want for their competitors to know that he was working for them and doing the statistical analysis, but anyway that's his test, t test, that's the quint essential test, that's the one we're going to use most of the time, if our data set looks something like this, we were going to use a non-parametric t test and samples of those would be the man, Whitney, will coxon the rank sums test or man, Whitney, you test, we'll play around with some parametric and non-parametric t tests, again these would be for a ratio type numerical data, when we deal with categorical data we're going to use different types of tests, excellent