 This video is part of a recording of a lecture that I recently gave on an introduction to health care statistics or biostatistics if you wish. It's part of a two-hour lecture that I gave and first and foremost, we've got to realize that you can really not teach statistics within two hours. So this video is really going to be an introduction. We're going to concentrate on probability theory and comparisons by that I mean we are actually going to look at t-test and over and chi-square tests etc. We are going to get to that but we're going to approach it from a point of view that we really want an intuitive understanding of what these things mean and what they give us and what a p-value is. So I think quite a worthwhile session and I really hope you learn something from it. Now statistics really is not magic and it's not the panacea that we wish it to be. Even if we call it evidence-based, we love the term evidence-based in medicine. It is the best that we have but we've got to understand what the values that we read in journal articles, what they really mean and what their limitations are. I will always maintain that health care science is really a science that's poorly understood. This is not physics. We do not understand it to the level of quarks, gluons and other subatomic particles. We really are skimming the surface health care. It's such a young science and I think it's important for us to realize that we really make use of surrogates when we do research and we use tumor size. For instance, that's a surrogate of some molecular level activity that's taking place in a patient's cells. There is so much more to it. We really, when we look at the variables that we use, we, in clinical research at least, we really use surrogates and we've got a long way to go to fully understand the disease and human physiology. We've made great strides, but we've got to understand our limitations and the limitations that our statistical analysis really brings. I want you to leave you in this, throughout this lecture, just with a bit of healthy, at least a healthy understanding about these limitations, to leave you knowing that we are perhaps not using statistics as it was intended by the people who develop these equations and develop these tests. And perhaps, and I put it very strongly here, that p-values are evil, they're not really evil, but I think we put way too much emphasis on them. We are using them in the wrong way and we must be careful when we just see a statistically significant p-value and accept that it's face value. We must understand what this is all about. I want to leave you with a few resources. You see the websites there. The first one is a Coursera course on understanding healthcare and life science statistics, biostatistics. Of course, the six weeks course that you can follow there. For this lecture, I'm going to use the Mathematica program using the Wolfram programming language, just to show you the concepts of these tests and the concepts of probability and what a p-value really means. If you want to know more about the actual coding that you are going to see here, which is not the emphasis on this lecture, there is another good course for you on Udemy. It is Mathematica for statistics to teach yourself how to use this code to analyze your own data should you be interested in. So the first thing I really want to talk about here is just what the variables are. It's so important to understand what a variable is and how that helps us to determine what tests to do and what information we can get. Now this is simulated data. It existed in this spreadsheet which I've imported into my Mathematica notebook here and you see the first five rows of this data entry. And this is how you should view data. If I look across a row here, this first row, that would be a patient. This is all simulated data for a patient in a study. And under each of those columns, each of these columns are a variable. The first variable is age. There are certain statistical tests I can do on this variable age. I can compare the ages between this dichotomous gender variable here. I can group it between males and females and see if there's a difference between the age. The age here would be a variable. Gender is a variable. The logistics, this is where the patient was admitted after being seen, that is a variable. The admission temperature of the patient, that is a variable. Whether the patient had an insulin-dependent diabetic, that would be a variable. Everything is a variable. These variables are in these columns. And the data points for those variables, let's look at age, these data points 77 and 64, they are all of the same type. The entries, the data point values for my variable are all the same. These would be integers, whole numbers. Those data points have a certain type, so my variable, their age, is of a certain type. So let's look at the types of variables that we do get. And as I mentioned, this will determine what kind of statistical tests we can do. Now the first two categories are numerical variables. You see them here and categorical variables. Now numerical variables, those are numbers as you might assume. There are basically two different types. There is the interval type and the ratio type. The interval type doesn't really have a true zero. We can think of degrees Fahrenheit or degrees Celsius. Zero is not a true zero. So if we look at the Celsius scale for instance, I cannot say that someone or some object that has a temperature of 40 is twice as warm as something with a temperature of 20, because that zero is not a true zero. The true zero exists on the Kelvin scale. And on that scale, that 20 degrees centigrade or Celsius would be 293 Kelvin. And I would need to duplicate that, multiply that by two to get something that is twice as warm as one object as another. Not really of particularly importance to us in biostatistics or life science statistics. What we are going to see is this ratio type. That has a true zero. Age has a true zero. We see the ages here. There is a true zero. Someone who is 30 years old is twice as old as someone who's 15, etc. Now there's another way just to look at numerical variables. And that is to see them either as discrete values or continuous variables, continuous values. And that's actually quite important. We are going to deal mostly in the type of statistics that we see with continuous numerical variables. So just look carefully. So instead of saying there are either interval type or ratio type, I'm now stating they're either discrete or continuous. Discrete is what it says. These values are discrete. They come in packets of one unit. And I cannot have a half a unit or one and a half unit or 1.4 units. They come in discrete packets that I cannot break down any further. And I'm going to use a beautiful example outside of biostatistics. And I think everyone can guess what that's going to be. We're going to roll a pair of dice or roll a single die. The value that lands face up is a discrete value. I cannot roll a one and a half. And we're going to have a look at that. What we are going to deal with those continuous variables, in other words, there is the existence of a value such as one and a half or 1.4 or 1.45 or 1.453297864. I can keep on splitting these numbers up. They are continuous. Then there's the other big group categorical variables and they're two basic types nominal and ordinal. Now let's have a look at some of these examples. Now this is not about writing the code. I'm going to use the Wolfram language here to write a few lines of code to give me some sample or simulated data points to use as illustration. So here I'm just going to use a random integer. And what it says there, don't be too concerned about the code. You can do one of the courses to learn how to code like this. But I'm just asking, give me 100 values between 18 and 80. And there we have it. We have a list of values. So I'm going to use that as some simulation. So let's imagine that this was a research project. There were 100 people in this or patients in this research project. And this might be their ages. So I have a random list of the age of 100 patients. Now let's just look at a way of representing this numerical value. Now you might see that I've put this under discrete. So I probably shouldn't have used age because remember you can be one and a half years old, two and a half years old. So perhaps age was not the best example. Let's just assume that these are values that cannot be divided any further. So I've just got this list of values. Now it's really difficult for human being to look at a list of values like that and get some sort of understanding of what's going on. So we would like to summarize this list of 100 values to perhaps generate a one single value that can stand as a representative for all of these. So the one thing, of course, this would be a mean. So there we have the mean and the mean value is 49.52. So it's just adding up all those values and dividing it by 100. And that 49.52 stands as a representation for all of these other values. It gives us a ballpark figure to work with when we think about the status set of patients. Median, of course, puts all these values in order from smallest to largest, all 100 of them. And it'll find a value that is such that it will completely divide this data set in two. So half of the 100 will be less than that value. And half would be more. So it's about just dividing the data set not as not the actual values as such. And we see that median is 50.5. In other words, there'll be 50 values that are less than 50.5 and there are 50 values that will be greater than 50.5 of these 100 values. So that helps us give a point estimate or measure of central tendency just to represent a value to represent our data set. But that's a central point somewhere in the middle. But we also need to know how these values are spread. For that, we have the standard deviation. So let's just level look quickly at what the standard deviation is. And we see the standard deviation is 19. What the standard deviation really does is it takes this mean of 49.52. And it looks at what the difference is between each of these values and 49.52. So there's a big gap between 19 and 49.52. There's a bigger, there's a smaller gap between 53. So we're just subtracting each of these from 49.52. Now some numbers will be negative. As a 19 minus 49.52 is going to give you a negative value, but 53 minus 49 is going to give you a positive number. But I want to add all these. And I can't add these negatives and positive sides square all of these differences. And then I just divide by how many I have and that was 100. And that actually gives me the variance. But because I squared it, I have to take the square root again to bring it back to what we have here. And that is the standard deviation of 19. So what we really have there is the average difference between each value and the mean. Another way just to look at the spread of your data, of course, is just to look at what the minimum value was. And we see the youngest patient was 18. And remember this is simulated data I created this data. It was done in a random fashion. So every number between 18 and 80 had an equal chance of being chosen each and every time for each and every of the 100 here. And we see the maximum there's 80. So if you were to report on this data set, you might bring in the median and you might bring in the minimum and the maximum gifts. The reader of your report an idea of what this data was. So within just giving your reader three values, it's a good representation of all these 100 values, a good summary of that. Now we can also not only divide it into half and half, but also into quarters. And that's called the quartiles. So the first quartile is 35. And that means a quarter of the values of those 100 values will be less than 35 and three quarters will be larger. The second quartile is also the 50th percentile or the median. You see it's exactly the same as the median. So half is less than and half is more than. And the third quartile here is 67, meaning three quarters of the values will be less than 67 and a quarter will be more. Lastly, there's the interquartile range, which is just the difference between this third quartile minus this first quartile. And we can use that to decide if a value is, for instance, a statistical outlier. If we add the 32 to 67, any values that will fall outside of that, we might suggest as a statistical outlier, we subtract 32 from this 35 and any value below that we might see as a statistical outlier. So we can use this interquartile range. Let's look though, just add a simulated continuous numerical variables of its asked, use the Wolfram language to give me 100 values between 5 and 25. But this time, I want them to be real numbers. So you see 17.0368 here. Just to give you an illustration that these numbers can get smaller and smaller and smaller decimal values. So they really, these do not come in a discrete packet. So that's all I wanted to show you here. So, well, at least two things to show you here that there is this idea of a discrete value, the role of a die, and there's this idea of a continuous variable where there's plenty of decimal places possible and can basically carry on forever these decimal places. The other thing I wanted to show you here is just some descriptive statistics. So now forgetting what these values really are, what type they really are, but just representing them in some form of descriptive way. Let's move on to something more interesting nominal categorical. So now we're moving away from numbers into categories. The first type is this nominal and nominal means I can't put any natural order to these things. So once again, I'm simulating some data here. I'm saying give me the values for 100 patients and choose from one of these three diseases for each patient. So in the initial independent diabetes, chronic obstructive airway disease and hypertension. So there's no natural order to these. How would you order them? These are just terms, these are categories. So there's my 100 patients, the first one had COAD, the second one, the third one, the fourth one had hypertension, hypertension, hypertension, another airway disease, another hypertension, and so on. So there are 100 values, there's no specific order. All I can do is I can count them. So let's just do that. So instead of that, I get a count and I see in my data set of 100 simulated patients, the most of them had COAD and that would be representative of this data set. That would be the mode, the value that occurs most commonly. Now let's move on to a categorical variable that has a natural order to it. So here I'm assuming that you gave in your research, you gave some people a survey to complete and those surveys had statements and they could choose with each statement to either strongly, let's just make that correction there, strongly agree, agree, neither agree nor disagree, disagree, strongly disagree. So you can imagine there's some natural order to this. So let's simulate again 100 people filling in that question, responding to that statement in the survey and I count it again and I see this, my simulation here, let's look what was the most. It looks like there's 23 there, neither agree nor disagree. Once again that would be the mode. Now we move on to a bit of more controversial type of ordinal categorical variable and that is converting a category into a number. So this might be a pain score. A patient had a procedure and afterwards they need to state what pain that they have from one, the least amount of pain till five an excruciating pain. So let's simulate giving 100 values there, 100 patients in my dataset and I see a first patient chose one, the second patient had three and you might suggest that these are numerical. First of all, you might suggest that they are discrete, but they really are not. These are not numbers because there's a fundamental thing I cannot say here. I cannot say someone who chose four felt twice as much pain as someone who chose two. There isn't a fixed numerical value difference between these choices. It doesn't exist. It doesn't make any sense and therefore it really doesn't make sense to get the mean of these values because you're going to get a mean say for instance of 3.8, but what is a pain value of 3.8? How do you interpret that as a human being? It's impossible. It makes no sense. That's not how the scale was developed and it's really dangerous to see this as numerical variables. Instead, we'll count them again and we say there were 17 threes, 17 fours, 21 twos, 22 fives and 23 ones. So one was the mode. So I'd express this as a mode not as a median and not as a mean because there isn't this nice numerical difference between these. They are really not numbers. So if I were to do something like this, let's simulate another. But this time I'm just making them integers and I do something like what was the mean of this? I get a mean of 2.93. What is a pain level? How do you interpret a pain level of 2.93? So it's a dangerous thing to do to express these really as numbers. They have an order to them, but they remain categorical. They are ordinal categorical variables. So those are the different types of variables and ways to just describe them either through measures of central tendency, which is mean, median mode, which be the most common ones and then dispersion, which might be standard deviation or the square of standard deviation, which is the variance or something like range, which is minimum to maximum or interquartile range, which is first quartile to third quartile. Now that we have a better understanding of the types of variables, which you really must, you must be able to look at any data set and decide what type of data point values are held inside of that variable. And once you've gone through your data set and you've done this descriptive statistics by other measures of central tendency or measures of dispersion, you get some intuition about what is going on in your data set. What are the numbers starting to tell you? And these data set values, they want to tell you a story and you've got to drag that story out of them. The next thing I like to do is just to visualize the data. Visualization in my opinion is even better than just staring at these numbers. So the first thing we might want to do is the following is a list plot or a scatter plot. So I just want to show you a few examples. Now one thing very important to remember with a scatter plot is that there is a numerical variable on the one axis and a numerical variable on the other axis, axis I should say. So imagine that I have patients once again and I have two variables for them, the hemoglobin level and the white cell count level. So for each patient, each dot would represent a patient and that would be the value for the one continuous numerical variable and there would be the value for the other continuous variable. So it's continuous numerical against continuous numerical and that's a scatter plot. Now there might be what we call correlation between these. So let's simulate these data points. Let's imagine that at the bottom we have a white cell count and C reactive protein, which might indicate infection on the other axis. Once again, it's a numerical value versus a numerical value for each patient. But we see there's they're not independent of each other, these two variables as the one increases, the other one increases as well. This data is trying to tell me a story that there is some correlation between these two. There's no proof of causation, the increase in the one doesn't cause, cannot prove that, but there seems to be some correlation. Of course, we can also get negative correlation and there we go. It seems as if the one very continuous numerical variable increases, the other one decreases. Just for interstake, I can bring a third dimension into a scatter plot and I can see numerical variable on x axis, numerical variable on y axis, but the size of the dots is a third numerical variable, which is just represented by the size of the dots. So it seems that as of that third variable also increases as both of these increase. So with a linear regression, I might find out that I can predict the one value given some of the other values. Let's move on to another type of representation whereby we move away from numerical against numerical, and we look at a box and whisker plot. So I'm going to simulate two groups of patients. Each group has 200 patients. I'm going to stick to this concept of a patient, but it might come from a laboratory and we're not dealing with specimens instead of patients. So just bear with me if I overuse the term patients. But imagine I have two groups here. I have a numerical variable for the one and a numerical variable for the other, but these two groups make up categorical group. This is group one and group two. There's no natural order to those. Those are nominal. This might be smokers and non smokers or with diabetes, without diabetes. There's just these two categorical groups, but within those categorical groups, I'm looking at the same numerical variable, the same numerical variable. And what these box and whisker plots tell me, this white line in the middle, that's the median, and the edges of the box is the first and the third quartile. And then I multiply the interquartile range by one and a half, add them and subtract them from these quartiles, which will give me the ends of these whiskers and every value that falls outside of that, like this little one here, we might view as a statistical outlier. But a box and whisker plot, remember, it takes a categorical variable on the x-axis and it looks at the same numerical variable on the y-axis for each of these groups. So that might be, ooh, what might this be? Let's suggest that some liver enzyme value and some liver enzyme value. So the smoker or let's make it the heavy uses of alcohol and people who abstain from alcohol, for instance. But it's the same variable for both, but over different categorical variables. So it's a categorical variable against a numerical continuous variable. It's very important to understand your variables. Let's look at a histogram. Let's simulate some values up here. Let's do that. There we go. And what we see here is I've got a variable here at the bottom. So I've made it 18 to 80 again and give me 500 patients. And let's imagine that this is age. So there's the patient's ages and I'm going to create little bins. So everyone from year from zero to 20 years, 20 to 30 years, 30 to 40 years, how many patients fall inside of that little bin. So what I've done really, I've created from a numerical variable, continuous numerical variable age, I have created a categorical variable because this is actually, well, you can view it as a little category, the category of people less than 20, 20 to 30, 30 to 40, etc. Doesn't matter how you view that, but at the bottom, we had some representation of a numerical variable and we count them on the y-axis. So there were 13 that were less than 20, there were 86 between 20 and 30, 78 between 30 and 40, etc. Now I can just ask for this to be represented in a slightly different way. What I'm doing here is not the absolute count, but the proportion of the counts. So 13, remember I simulated 500 patients or 500 specimens, doesn't matter, but there were 500 entries in my data set and of those 513 were less than 20 and 13 divided by 500 gives me 0.026. 86 of the 500 is represented by 17.2 percent of the data set. Between 30 and 40 was 15.6 percent, but I want you to start thinking of something. Imagine 500 people stood in front of you and you picked one at absolute random. I can now ask you from just viewing this, what was the probability of choosing someone from those 500? You closed your eyes, 500 people stood in front of you, they had this range of ages and you just draw one. What was the probability of finding someone who was older than 80? Lo and behold it was 2 percent, 0.02 and I want you to start viewing that as a p-value. What was the likelihood of you having found someone blindfolded from those 500 patients that was older than 80? It was 0.02 and you can well imagine that that was a statistically significant finding. There were not many people who were older than 80 in my data set, so what you found by just drawing a single person was a statistically significant finding. Start seeing it in this way and I can represent that p-value as something physical, the area of a rectangle. I'm now viewing these as discrete units, so the base is 1 and the height is 0.02, the surface area of this little block is 0.02, this little block, because if I take all of these as surface area and I make all of them, add all of them together, it will equal 1. It's all 500, 500 divided by 500, that's the whole data set. So that all these areas combined I could view as 1 and the physical area of this little block is 0.02. So it's not absolutely true what I said now but we are developing this intuition of what a p-value really is. Now I can be very fancy and I can ask Mathematica using the Wolfram language here just to draw me a smooth histogram and it'll use a little equation, well a complicated equation, just to turn these little step functions here into a smooth function, but the area under this curve is all I want you to see here would still be 1, it represents all of my patients. Anyone that I draw would be 4 somewhere here. So just keep these things in mind, I'm not telling the full truth here, I just want you to start thinking about area as representing a p-value. Now the last type of visual representation is just going to be a bar chart, there we have a bar chart and we're just counting them, I've just created these values, there were five of I, the insulin-dependent diabetics, there were six COPDs and three hypertensives and I can just count them. So that is a count again of categorical variables that is a bar chart. So there's this something that's slightly similar between a histogram and a bar chart. Let's move on and talk about distributions and this is where we're going to develop this intuitive understanding of a p-value being a physical area, a circle has an area, it's pi r squared, trapezium has a formula for its area, a square has a formula for its area, area is a p-value and I'll show you exactly how to develop that intuitive understanding of that. So what I'm going to simulate here is I'm going to roll a die, a fair die, it has six faces, one, two, three, four, five, six, every time I cast my die, one of the faces lands face up and I'm going to simulate that by using this very simple Wolfram language line of code here, a random integer between one and six inclusive and please give me 10,000 of those please and I'm going to store that in this computer variable called die and now I'm just going to ask the computer just to show me how many of these did I roll and you can well imagine out of 10,000 rolls of there was a fair die very close these values to each other because every value really was equally likely at an equal likelihood of landing face up so these values are quite close to each other it was only 10,000 rolls you might think that that is a lot but it's not you can see these values are close to each other there was almost as many ones as twos or threes as fours as five to six as you would imagine now someone really clever came along and said well if I had if I could choose between these six discrete values every time I did an experiment my experiment is rolling a single die there must be some mathematical equation that I can write that will tell me what the probability is of rolling any one of these intuitively we was very easy to imagine it would just be one over six but there is a mathematical equation I can write for this and that's what I'm putting on the screen here that's called a pdf any distribution this is what it's called how are these values distributed how likely do I is it to find a certain value comes in a certain pattern now this one would be called uniform discrete because everyone has an equal likelihood of landing face up but there's equation for that it's one over one minus a plus b where a is your minimum value and b is your maximum value the zero here says there's zero likelihood of anything outside of a and b I cannot roll a zero and I cannot roll a seven or an eight or a nine so that's what this piecewise equation looks what is piecewise like this but it says what is the probability of rolling a three or a two or a one as long as it falls between these two and it's one over one minus a plus b so here we have it one and and the numerator the denominator one minus the minimum was one the maximum was six and there we have it that is a probability density function gives us a mathematical way of expressing the straight line at the top there and all it says it's a six so to roll a one as a probability of one over six to roll a two and again I want to see to ask you to see this just as a as a p value and I want to do that by just looking at this histogram once again which I'm going to use a probability of now it might slightly have been confusing when I made little bins out of the age here there's no ambiguity the base of this rectangle is a base of one a base of one discrete values always have a base of one and the height is well there was here one thousand seven hundred and five out of the ten thousand we're a one so repeated my experiment ten thousand times that'll give me a probability of rolling a one of 0.16 there's 0.17 0.16 0.17 0.17 0.17 0.1667 be close to this theoretical probability distribution of 0.1666 one over six in numerical terms but once again I can ask you please roll for me a single die you roll a one I can say what was the probability of rolling a one well it was 0.16 theoretically then 0.16667 or one over six but once again that was the probability of having done your experiment and and and rolled a one that was the probability of getting a one a p value almost of getting a one and again very easy base times height of this representation gives me my p value keep that in the back of your mind as we develop this intuition let's move on to this normal distribution and what I want to do is I want to roll the die again and with a value of one to six I want to do it ten thousand times but this time I'm rolling two die so a pair of dice are now being rolled and I do that ten thousand times let's look at my first ten simulated rolls and it shows I rolled a one and a five a six and a three a four and a one a six and a two a two and a six so that's lovely it's a bit different but between one and six inclusive every single time and all I'm asking uh Mathematica to do for me now is just to add all of those this is what this line of code would do so one plus five is a six so if I rolled my my two die my pair of dice first time I rolled a six now I rolled a nine then a five then an eight and then another eight and then a five and then a ten etc so I've just added all those ten thousand and this let's look at how many of each I got now you can well imagine that the minimum was two so it could only roll a one and a one and the maximum with double sixes are 12 so how many twos was there well there were 265 but look at this there were more threes even more fours even more fives even more sixes a lot of sevens and by the time I get to 80 goes downhill again and downhill and downhill and downhill now look at this let's draw that a beautiful normal distribution here don't we have much more likely to roll a seven if I just add the two values up very unlikely to roll double ones just as unlikely to roll a double six again let's use the trick of dividing it by how many times I roll by the ten thousand which gives me this probability density plot almost and yet again I can look at this and say if I give you a pair of die and you a pair of dice and you roll them what is the probability of rolling a double six well there we go it's 0.0268 it is the base which is one times the height of this little rectangle this geometric area of this little rectangle if I were to express it in this form gives me the p value I can now ask you what was the probability of rolling 10 or more it will be this plus this plus this to ask you what is the probability of rolling a 10 11 or a 12 well you can simply work that out very easy to do now this sticks with this concept of a discrete variable which I cannot subdivide I cannot roll if you want to imagine it that I can't roll an 11 and a half and that's why these blocks at the bottom because I'm viewing it as discrete values as we did with the edge bins upstairs they have a base of one and then just multiplied by the height but the normal distribution on the bottom will have a continuous variable so these little blocks here won't have a base of one remember what I said about continuous variable numerical variables in the beginning I can make them smaller by just increasing the number of decimal places for 8.76543 and I said well that's not the smallest I can add more decimal values and more decimal values and more decimal values so this this base shrinks and shrinks and shrinks until I get to this idea of a continuous distribution and the normal distribution has this probability density function and we see in here a few things there's mu there I hope you can see it's a bit small there's mu there which is the mean and there's a sigma there which represents the standard deviation so but depending on the mean and the standard deviation we can draw a normal distribution let's have a look at a normal distribution if we plot it there we go we have a normal distribution now let's just move it across I can have written the codes I can just move my value across and I can also change the spread in my data what I want you to realize though this is the exact kind of thing we had up there where we had what the probability was on the y-axis here we have probability on this axis here as well but at the bottom we don't have little squares because little rectangles because this is not many kind discrete variables anymore it is now continuous variables so these little tiny slithers of rectangles become so small they vanish so that everything is continuous once again though try to understand that this is the all the likely outcomes this is these were all the likely outcomes of rolling to die this might be all the likely outcomes of an experiment I might have two groups of patients and I'm measuring some liver enzyme in each one of them each of these two groups and one group will have a mean of that liver enzyme of a certain value the other group will have a mean of that enzyme which is slightly different I subtract those two from each other and what I'm just saying here that the difference that you find between your two groups will be somewhere on this x-axis and it's very easy for us to work out what the probability was of finding on the y-axis what was the probability of finding that specific value now let's just be clear about one thing I can't do this anymore I can't just read across as I've simulated with a mouse now when I come here there is no little base for me to make these little rectangles all I can do now is draw a line and ask what was the probability of getting this value or more so I'm going to actually color in this area under this curve here and that will be my p-value but we're going to do exactly the same thing this area under this curve it's the same as the area of one of these blocks is going to give us a p-value so just as far as this mu and sigma is concerned let me just show you that if I take a normal distribution with a mean of mu and a standard deviation of sigma the mean of that would be mu and the standard deviation of that would be sigma so just as a point of illustration and the reason why I want to show you that is we don't always have the mean and standard deviation because what this mean and standard deviation here refers to is actually the greater population so once again if I use patience there are about seven billion people on on earth that would be the mean of their liver function test and the standard deviation of all those seven billion that's what I'm referring to here when I do a test I only have 30 patients in my study or 60 or 80 or whatever specimen you have you have a limited number you certainly don't have the full population so we don't have access to these values so then we don't actually make use of the normal distribution we make use of a very clever distribution called the t distribution and I can show you what the equation for that looks like quite a bit more complicated we see the beta function there but it depends on a single variable here which I've used delta here which is degrees of freedom and degrees of freedom is a bit of a difficult concept for us here it might be easy if I have 60 patients in my in my dataset and they come in two groups 60 minus 2 is 58 so this value will be 58 let me just show you if we draw this distribution on the screen there we go so I've drawn it with two degrees of freedom and with 30 degrees of freedom you can clearly see though that it almost looks like a normal distribution so we have a little equation this but it does not depend on the population standard deviation so I don't need that information all I need to know is how many people are in my study or how many specimens are in my study and how many groups I have and from that I can use this equation which will draw this beautiful little line and my value that I find will be one of the ones on the x-axis in my in my little study and we can work out an area under the curve which will give us you guessed it a p value let me just show you another distribution this is the chi squared distribution there is its equation for this the the function that we can draw and let's just draw it for you on the screen there with also uses degrees of freedom and you see there with two the blue is two and the orange has six and the green has 20 20 degrees of freedom but all these distributions come in this nice little pattern with a nice little equation and I can draw these curves and I can calculate areas under these curves because the area under each of these would be one the area under each of these curves is one representing all the possible outcomes that are possible if you do one study yours will be one of those let's develop that knowledge that we have now a little bit further and I'm going to ask you to stretch your imagination a bit and imagine that once again I'm going to stick to human beings as my as my experiment imagine that there are only 20 000 people on the face of the earth there are only 20 000 people not 7 billion 20 000 so use your imagination I'm going to measure something in going to draw some blood or do something to these patients I'm going to measure some continuous numerical variable in them again this might be a liver function test I don't know my anything you want and I'm going to just simulate these 20 000 human beings and this variable for them now let's just look at because I use this discrete uniform distribution I look at this histogram and I see well it was if I were to draw someone at random just from my population of 20 the probability of having for that patient having a certain value between 50 and 110 is almost exactly the same there's a little bit of variation here so certainly there's no particular pattern here to my population the variable in this population I can look at what the mean was just for my simulated set it worked out to 79.9 we're close to a theoretical 80 when I have a discrete uniform distribution it would be maximum minus minimum divided by two that's exactly what we've done there so the theoretical mean would be 80 not important all I want you to understand is that every value every patient that I would just select at random every value that they have for that variable would be as likely as any one of the other values in this range from 50 to 110 so imagine that's only 20 000 people on the face of the earth and I want to do an experiment on them so my experiment because I don't have seven billion people and I might not have the money my experiments only gone to consist of two patients so I'm going to draw two patients at random from my 20 000 population and I'm going to measure this variable in them and from those two patients I'm going to get what the average was so you know one patient might have been 71 and the other one 81 point something and I'm going to get between those two I'm going to get the average and I'm going to jot that down on on on my computer on piece of paper and let's imagine I come back tomorrow and at complete random I draw two people again do my experiment take down those measures and calculate the average tomorrow I get up and I take another two people at random now that's not the way it works if I were to do some research I'm only going to do this once but imagine we could do this experiment 25 times and each of those 25 times at complete random I might have two different patients in my little trial but I jot down those 25 averages and let's just have a quick look at what the distribution is of all of these means these 25 means given the fact that underlying there is no there is no not not a true distribution I mean there is one that is this uniform discrete but there isn't really a discernible pattern there if I can use that statement but look at this when I just look at the distribution now of the means now this is not the distribution anymore of the variable itself it is a distribution of the means suddenly there's some pattern developing here bar this little one on here it almost starts looking like it's normal now let's do let's ramp this up instead of taking only two patients every time I rerun my experiment I take 30 people at complete random from my population of 20 000 I take 30 people measure this living enzyme value or whatever it is and I calculate the mean for my 30 patients I come back the next day I repeat my experiment all over again take 30 people at random test them get the mean and I would do this 500 times over 500 times and I plot this distribution of possible means now something spectacularly starting to develop here the fact that I'm starting to develop a normal distribution here so the distribution of sample means is normally distributed and that's called the central limit theorem and it underlies all of these tests that we do or most of the tests that we do in healthcare life science etc statistics it is the fact that the distribution the distribution of sample means is itself normally distributed understand that very clearly very fundamental to what you're trying to achieve here the distribution of sample means will always be normally distributed and that is stated by this theorem the central limit theorem guarantees that now I could not go do my experiment 500 times my experiments only done one I'm only going to write do one research project it's going to have 30 patients in it but they come from this larger population in irrespective of the distribution if I were to calculate the mean value for my 30 it is going to be one of these possible ones now this is only simulated 500 times imagine I simulated it a billion times I tell you that would be smoothly normally distributed the sum the distribution of sample means are normally distributed and some will occur very frequently so out of my simulated if I could do this experiment very often 15.6% of the time it had this range of values was very unlikely to find one of these and that's a p-value so what is happening here a distribution is plotted most of the time it will be a t distribution which simulates repeating this experiment a billion times over and yours is but one of those possible ones but it would fall somewhere on this curve allowing me to use mathematics to calculate the area under a smooth curve and that is a p-value one thing for those of you are interested a distribution of sample means does not have a standard distribution the distribution of sample means doesn't have a standard distribution on this axis here it actually has or on the spread it actually has something we call a standard error and at least for the normal distribution that this is divided by the square root of how many there are so that's just for interest sake we don't talk about standard deviations anymore when we talk about the distribution of a statistic then we start talking about standard error and for all these distributions there are different ways of calculating the standard error so let's just have a look at this this is simulating this happening billions of times this would be the distribution theoretical distribution of sample means so I'm using mu and my standard error from my little experiment here but simulating it many many times over it would give me the smooth little one and if I were to take 30 people and I would to calculate the mean for that variable my mean would be converted to a standard deviation value which will actually fall somewhere on the x-axis and I can calculate the area under the curve let's look this at this in a graphical representation just to hit home what we're trying to say here is that if you could repeat your experiment a billion times over your value would be just one of many possible ones the distribution of those possible outcomes is on a curve like this because the bottom is continuous I can't make little rectangles to calculate the area under the curve I have to use something else and that's something else by the by is integral calculus now what I'm going to do is this is my theoretical distribution of possible outcomes I'm going to calculate where this line is to be drawn here that would represent this orange area that would represent this is not to scale but that would represent five percent of the area under the curve now I look at what I found and perhaps here it was here I convert my finding to units of standard error that gives me a statistic that I can plot you on the x-axis draw my line there and now I calculate what was the likelihood of finding this value or greater or if it was on this side or less remember I don't have a rectangle I can't just read it off the only area I can now do is from this outside from this side outside that it becomes my area instead of the little rectangle because this these little bottoms here become infinitely small because this continuous values but this would be a p-value and in this instance it is less than five percent you can clearly see that so the value that I found was a statistically significant value now this is called a one-tailed hypothesis now what is this hypothesis and that's another thing that you have to clearly understand before we start any research we must have an hypothesis and the first one the one that we assume to be the truth is what is called the null hypothesis I'm going to take two groups of specimens humans or some laboratory experiment but I'm going to have two groups I'm going to measure something in those patients or in those specimens that's my variable the one group will have a mean of a certain value the other group will have a mean of a different value I subtract those two from each other and that is my finding my difference in means and my null hypothesis is that there is no difference between the means I accept that as the truth then I go out and I do an experiment to see if that assumption actually holds I put this threshold of five percent on it 0.05 if I find a value smaller than that I reject my null hypothesis and I accept what is called my alternate hypothesis so if my null hypothesis was there's no difference in the means between these two group or no difference between these two groups or three groups or whatever the situation might be whatever test you're using there is no difference my alternative hypothesis states there is a difference so what we're actually dealing with is this situation here just some graphic I drew in adobe illustrator imagine my difference was one group was more than the other and I draw my line there I have to duplicate this black line on the other side on the negative side as well because my alternative hypothesis stated there is a difference I did not know beforehand that the one group will have an average more than the other group or less than the other group my alternate hypothesis there is this there is going to be a difference I found a difference it might be if I subtract 20 from 30 30 minus 20 is my difference was 10 but if I subtract a 30 from 20 that would be negative so it just depends what group I put first the average of one group minus the other group if I change the order one would be positive and one difference would be negative so I duplicated on both sides and I also take my orange little lines here and I have to in my in my mind's eye at least make two and a half percent on each side so that's going to be slightly smaller than this bit so it's two and a half percent on the side two and a half or 0.025 and 0.025 so I duplicate these things on both sides and that is really what you always have to do as a researcher beforehand is have a two-tailed alternate hypothesis just stating that there is a difference the difference that you find you duplicate on both sides so your p value is actually twice as big as it could have been this is a one-tailed hypothesis test and I have to agree with my peers I have to come up with logical arguments before the experiment is done why the one group would necessarily be have a mean more than the other group and we all have to agree that those logical arguments hold I cannot just decide that after the fact I'm going to change from a two-tailed alternate hypothesis to a one-tailed hypothesis because what I will do is my equation that I get a p value from I would automatically just divide the p value by two so if I found a p value of 0.06 I think oh that is not statistically significant oh but I always knew that this group would have a mean or an average more than the other group so let's switch to a to a one-tailed hypothesis I take my p value of 0.062 I don't duplicate it on either side I just keep it on one side lo and behold my p value is now 0.03 oh look I have found a statistically significant difference between these my p values 0.03 no you cheated beforehand you have to decide am I going to stick with a convention of my alternate hypothesis being a two-tailed to stating that there is a difference meaning I use the standard calculations which is going to duplicate it for me on both sides behind the scenes of the code that I write or the test that I use and I get this two-tailed p-test I cannot then change it after the fact to a one-tailed by which means I just divide my p value in half you cannot do that you have to do this before so a lot of information here spoke about a few different things first of all your difference is going to be one of a many theoretically many different and most of the time you're going to find one of these it was much more likely to find one of these you color in this area under the curve so all of this area would be black and then all this side will be black that's a lot more than 0.05 of the area under the curve so that is not one of the values that would be rare to find you found one of the common and we do not reject the null hypothesis our p value is larger than or equal to 0.05 we say that there's no difference between those groups now before we get to the last part we were actually going to do some t-tests and ANOVA and some chi-square tests to show you how to do it there's one more thing that you clearly have to understand and if you're not you know if you haven't paid attention up till now probably you wouldn't be still be listening or watching but there's a very fundamental thing going on here this equation that was used to construct this this t-distribution that was used to construct this has a very important assumption now I want to explain this assumption in the form of an imagined deity whatever deity you wish to insert here so imagine this deity looks down on the planet earth and knows that between these two groups there really isn't a difference for a certain variable so let's just assume that that's a given fact there is no real difference between these two groups if I were to take the make my two groups by doing some experiment and measuring something in them they would have exactly the same average there really isn't a difference that is the assumption which is used to draw this curve so given that absolute knowledge that there isn't a difference you do your experiment put 30 patients in one group 30 patients in another group and you find a difference sometime or most of the times you'll find differences in this region it will be less and less and less and less and less common to find differences out here given the fact that there really exists no difference you then find one of the rarer ones that should occur and now we reject our hypothesis we accept alternative hypothesis and we say that we found one of the rare rarer differences therefore there must really be a difference we have a statistically significant finding we declare that there's a difference between these groups end of story no not end of story the statistics really didn't prove that this is what we've decided as human beings to accept so that we can do research there wasn't really a difference because this curve was developed under the assumption that in reality there is no difference given that fact this curve is drawn and given the fact that there really isn't a difference between two groups the difference that you find from your tiny tiny little sample would be one of these some of which are more likely to occur some of them are less likely to occur by convention we decided if you find one of the rare ones we would say there is a difference but in reality if you really think about it we've never proven you cannot prove that there is a difference because there is no graph you can draw here that is based on the fact that there is some underlying in reality there is some difference and you can go out and you can discover this difference the mathematics underlying this equation is based on the fact that there is no difference now it works for us because in reality there probably is a difference but that has nothing to do with this equation that we use to draw this line and to create these lines and put the areas under the curve calculate the areas under the curve because think about it I might have two groups of patients I treated one group of patients one way I treated another kind of another set of patients in another way and I look at the outcome and I see well there's a great improvement in this group and there's a less of an improvement of this group it clearly shows that this way of treating the patients must be better and we are now going to treat patients this way giving them this drug or doing this procedure on them that's just the way forward this evidence based we've shown that there is this difference that is I found a big difference and it was should be one of the rare ones so let's go for it you probably did discover something but you didn't discover it from the actual mathematics that was used here you just discovered that your value was one of the rare ones given that there really wasn't a difference now we can clearly see that there was a difference because these if I keep on treating my patients in this new way they all do better so they must have been something so I must have done something right by doing this research but I have to know in the back of my mind I have to be clear about the fact that it is based on this assumption that there really isn't any difference and now the one that you find was one of the less likely ones and that is what the p-value really is so let's just talk about William Gossett I am not a historian as far as I understand William Gossett worked for the Guinness Brewing Company brilliant statistician he wanted to publish academically the Brewing Company wouldn't allow him to publish on his own name they were worried about the discovery that their competitors might have so they allowed him to publish academic papers under pseudonym I think they gave him a couple of choices and the choice he took was student so we all know this as student's tea test but in reality think it should actually be called Gossett's tea test because his name was William Gossett so what William Gossett did he worked on this tea test but what he really worked on is the is the problem of working with small samples sample sizes that we see commonly in laboratories or commonly when dealing with patients maybe only 30 40 50 60 70 80 90 100 patients in a group these are relatively small numbers and he wanted or the aim at that time was to solve the problem of doing analysis on small values and the aim of people like himself Pearson all these big names in in mathematics in statistics was just to show that you can calculate this p value but if you find this small p value it is just a red flag that something might be going on here go out and get a large numbers in your data set very many many patients many specimens and redo your tests again to see if there's a difference but we don't do that we're not following the spirit of that discovery or that creation of of this kind of inferential statistics we do our test with 30 patients in a group of 50 or 100 we find a statistic in p value and off we go very proud of ourselves that was not the way that these p values were intended they were only intended as red flags to show there might be something here go out and do a much bigger experiments which we tend not to do we believe these p values as if we've discovered something profound now it works most of the times but I I hope you understand it works for different reasons than we we trust them to to have worked out so let me just show you if you're interested in a language like warframe language how easy it is to do we're going to simulate two groups of 30 patients from this distribution and I'm going to compare their means my first mean was of my first group was 101 my mean of my second group was 106 and I want to know is that difference in means is that statistically significant so first of all I did my descriptive statistics I looked at the mean and that mean might be again for some liver function test the one group had some treatment and that was the result second group had a different treatment and that was the mean of there so let's just visualize this and what I've done here for you is to store this little black line in by using this mean argument here just to show the white lines are the medium but these black lines the new ones they are the actual mean the white lines are the medium remember and the quartiles and the outliers beyond that so I see this difference in means and I asked myself is there statistically significant difference in these two patient groups I might as I said I might have treated the one group with one medicine the other group with a placebo is there statistically significant difference I look at this graph I see well there and I look at these descriptive statistics there might be something in it but now we're going to draw to do students tea test or Gossett's tea test in other words given that I have 60 patients I have two groups that means my delta my degrees of freedom is 58 that's going to draw a t distribution nice smooth little curve for me it's going to convert my difference which is 101.745 minus 106.226 and this instance going to be a negative value it's going to convert that into units of standard error by using pulled variance it's not important which is then going to be a little line on my x-axis as we had there it's going because this is negative it would be on this side I'm using a two-tail test it's going to duplicate it on the other side and then it's going to calculate the area under the curve for this two black areas and that is going to give me a p-value so let's have a look in this instance the p-value is 0.18 that's larger than 0.025 0.05 I do not reject my null hypothesis remember we never accept the null hypothesis we say we do not reject it therefore we say there is no statistically significant difference between these two means that t statistic you see there's negative 1.133519 that is what is the actual little mark here so it's going to do it at a negative 1.3 at positive 1.3 and then calculate these two areas under the curve and that is my p-value this is just how you convert your difference and the spread in your data into standard error that's not important you can have a look at that code I just want to leave you with the knowledge that there are other t tests it might very well be that there's a big variance a big standard deviation a big spread in this data and this data here is to spread small around the mean then I can't really use the normal Gossett's t-test I've got to use a slight variation of that that's the unequal variance t-test and I might not have independence between these two groups this might be identical twins a set of identical twins they are not independent of each other so I also have to do use another another little t-test another variation of the t-test to compare them or this might represent patients before their treatment and that very same patients after their treatment so data set would contain the two groups are formed by the exact same patients this is the value before and after those are not independent and then I use these what is called p paired value t tests now that is comparing a numerical variable between two categorical variables so that might be smokers non smokers treated with placebo treated with active drug so those two groups are categorical but the variable inside of that that I'm comparing but inside of those two groups are numerical so use my my t-tests if I want to compare more than three groups now let's add a group C here also 30 patients we see here if I go down those are 30 patients 30 patients and 30 patients now in each group now I use what is called analysis of variance it uses an F statistic and there is indeed a F distribution I've just used mathematical here to put my data in the correct order in the correct format that it needs so it says this patient was in group one and that patient had a value of 114 also in group one in group one there's my group two patients there's my group three patients I have to put it in this form for to use this ANOVA package to work out for me analysis of variance and the analysis of variance if I run that I see an F ratio there my F statistic is 0.9 and that represents a p value of 0.4 and I would say I do not reject my null hypothesis there is not a difference between these three groups remember something very important if you have more than two groups three four whatever groups you do your analysis of variance and only if you find a statistically significant p-value can you do post analysis now you can use Tukey's test or one of those to now look at comparing where this difference was was it between one and three group one and two group two and three so you do this post post hoc analysis but you cannot do this type of subgroup analysis if your initial p value looking at all the groups combined was not less than 0.05 then you're cheating that's not right look out for journal articles that might do this kind of subgroup analysis where originally the ANOVA had no statistically significant difference between this group then it becomes slightly or something that should not be done you can't do post hoc analysis of this p-value is not significant now there's one other two other assumptions that we haven't spoken of and I want to end with that something that you should know about the test that we've spoken about up till now so all the t-test and analysis of variance and even correlation which I haven't shown you here is based on these are called parametric tests and they're based on quite a few assumptions but two very important assumptions the most important is and it might be slightly difficult to understand is that the variable that you are analyzing has to come from a population in which that variable is normally distributed so if I were to compare liver enzymes between group A and group B I better know that the population of 7 billion people out there have this value of this liver enzyme nor it have a normal distribution but I can't test all 7 billion people or for other reasons the population from which my samples were taken might be smaller but the point being is I can't test all of them I don't have the time of money to do that so we have to develop other tests to infer the fact that the variable that's under analysis here comes from a population in which that variable is normally distributed the first is a visual way it's called a QQ plot so I'm going to simulate two groups there of 100 patients in each group and I'm going to show you what the mean is of these two so I see here that the means are quite close to each other medians at least there was 0.14 the median on this side was 0.44 so quite close but you can see that there's a way different spread of the data in this second group but I can now do what we call a quantile quantile plot QQ plot for each of these there's my first group theoretically the points these are all the values for say for the liver enzymes for these hundred patients should fall on this normal line here and you can sort of see they follow that line so that would be the assumption that really that in the underlying population from which these hundred people were taken they really have that value normally distributed and the other one I cheated a bit so I created it such that I know it wouldn't but this is the kind of picture that you'll get from a QQ plot these values from your data set of 100 patients do not follow this line at all so we are violating one of the base assumptions of the use of a student's t-test so I really can't use that kind of parametric test to compare these two sets of patients with each other now there are even statistical tests that people have developed clever clever people that will give you a p-value so let's look at my first group of patients and they are different tests like Komagorov-Smenov test the Shapiro-Wolk test that I'm demonstrating here and my first group that followed this nice align I see if I were to do this test to see if it comes from a normal distribution I get a p-value of more than 0.05 so I can't reject my null hypothesis my null hypothesis is that this was taken from a population in which that variable was normally distributed I don't reject that if I look at my second group or that's a very small p-value so I reject the fact that this came from a population in which the underlying the variable was was normally distributed and I cannot use parametric tests anymore I cannot use student's t-test so when you read someone's paper look at how they made that decision went use parametric tests and not parametric tests are more sensitive to pick up these small p-values more so than the non-parametric versions so there is a tendency perhaps somewhere if I'm being terrible here so she's here but you might find a p-value of doing this properly and doing a non using a non-parametric test and you might find a p-value of 0.51 and if you run the t-test on that it goes to 0.49 and there's such a need to use rather use that 0.49 but that would be wrong if you violated those assumptions do not use the parametric tests so the equivalent of a student's t-test the non-parametric equivalent is called the ship is called the main Whitney u-test before that I just want to show you why I love this language so much I can do the Shapiro walk test and and ask the computer to give me the test conclusion and it actually says this says this the null hypothesis that the data is distributed according to the normal distribution is rejected at the 5% level based on the Shapiro walk test such a nice language the warframe language to give you you know that kind of output the other assumption that you have to meet is that there are not a lot of outliers now I'm never in favor of removing outliers from a data set that patient or that specimen existed you really have to before you remove a patient or specimen from a data set you really have to think about it now falsely created a bunch of masses your patient's kilograms and I've and I'm doing this barn with box and whisker plot and I can see there are lots of outliers and I might be tempted to remove some of them from the data set they fall way outside the whiskers a better way perhaps of doing it is to convert each of these values into the how many standard deviations are they away from the mean warframe language very easy I can say standardized mass so it will convert 107.6 kilograms into it being 2.27659 standard deviations away from the mean and then I can just ask for all the values that are more than three standard deviations away from the mean and I see I have two of them and I might be tempted to remove those two patients from my data set but a difficult thing to do so test for those two assumptions I wouldn't remove them but it might tell me that I should not be using students T test here instead I'm going to use the man Whitney U test there's the man Whitney U test I have a U statistic of 3124 and a p value very very small so at least less than 0.01 so I've used the man Whitney U test here instead of the students T test that was the proper thing to do let's add a third group of patients here and the non-parametric equivalent of the ANOVA test is the Kruskal Wallace test very easy to implement I do my Kruskal Wallace test and I get this p value here very small so once again a statistically significant finding there I want to end off just with the comparing just give you an example of competing categorical variables we're going to do the chi-square test so how do you compare categorical variables I cannot use students T test or ANOVA or man Whitney U or Kruskal Wallace I'm not dealing with numbers numerical values I'm dealing with categorical variables so let's imagine my little experiment here I've got a dichotomist is for for just to make it easy I've chosen a dichotomist gender here which one shouldn't do but let's just imagine I have female and male and I have patients being hypertensive and non-hypertensive I don't want to exclude anything by using this dichotomist but that is for ease of explanation here so I have 13 females and I have quite a few more males in this study and I have 17 non-hypertensives and quite a few more hypertensives so I create this what is called a contingency table so my one group there is my hypertensive and my gender on the other side and I want to do a chi-square test of independence are these two variables gender and having hypertension are they independent of each other or is there some dependence thus being a certain gender influence being hypertensive or not I want to see if there's one influence is the other is there is there some form of dependence between these two that is what my p-value is for chi-square test is there a dependence between categorical variables now given this percentage of males and this smaller percentage of females versus this larger percentage of hypertensive and non-hypertensive giving those percentages this is my expected table this is my observed table from my study given those percentages of how many were male how many were female and how many are hypertensive non-hypertensive I can write this a line of code which gives me the expected table given those percentages 11.1 people which is not something that's you know you can't have 11.1 of a person but go with it should have been female with hypertension and 89.8 should have been male and 1.87 should have been non-hypertensive female 15 should have been so this is my observed table this is what I would expect given this these percentages and now now I can work out what the difference is between my observed and my expected group and I can convert that into a statistic called the chi-square statistic depending on the degrees of freedom it will follow one of those weird ones I showed you in the beginning the chi-square distribution that value will fall somewhere on the x-axis I can work out the area under the curve there we go area under the curve it's 0.11 I see I cannot reject my null hypothesis so I cannot say I have to accept the fact that these two things these two categorical variables are independent of each other that your gender is independent on whether you have hypertension or not in the study I did not find a p value less than 0.05 so there's independence between these there's no dependence on these two that is a chi-square test so I hope you've enjoyed this introduction to two statistics that you have some deeper understanding of the intuition about what these p values what it really is it's it's physically an area under the curve it's based on the assumption that there really isn't a difference you know to look out for when certain tests must be used based on the variables and based on the assumptions of normality and outliers you know what a parametric a non-parametric test is you know when they should be used you know what the descriptive statistics are you know how to visualize these different categories and you understand the types of categorical variables I hope I hope you've enjoyed that if you're interested the language that I used here is the warframe language beautiful language to code in so easy and powerful to do your statistical analysis have a look at the courses that I showed you right in the beginning go online you can use Mathematica in your browser free of charge online or you can you can without spending a lot of money install a local copy on your own machine really a beautiful language to learn I prefer doing my statistical analysis using a computer language other languages might be python or julia or even r which I really don't like but you can give that a go my preference and the preference in my unit and what I teach is is the warframe language in Mathematica have a look at it and if you want to get involved in doing your own research a beautiful way way to do it