 The connection between probability and area here we go our first real statistics Now you're going to go to a statistics course. They're going to tell you all about probability. It's very important But most of it you don't really need they're going to show you equations for p-value and Let the cat out the bag there p-value means probability You don't need all of that what you do need though is a deep understanding of what it really means Where does it come from you? We all read p-values, but what does it really? I think Before you find out what it really is people think is this this magical thing and you just Do a little sum and boom our pops the answer. It is much deeper than that It's actually very exciting stuff Let's get going Normally just I'm gonna import the style sheet so that I have this don't worry about that the environment that I want to Set up what I'm going to import is this thing called NumPy numerical Python because I just want to use a bit of NumPy I'm going to use the usual abbreviation there NP I'm also going to import my old friend matplotlib so I can do some plotting and Seaborn to make it look nicer and my usual filter warning filter warnings and Matplotlib inline I'm using the magics there so that it renders right on this web page And I'm going to ignore those ugly pink boxes. There we go You've got to understand a little bit of probability because as I said that is the p-value and Let me let another cat out of the secret. It's this geometrical surface area. That's all it is Let's roll to die. It's the only probability we need to do In this course, so I've got one die second die They've got one to six six-sided the one to six I'm going to roll the two of them and see what is possible. What are all the possible outcomes? I can roll a one on the first die and the one on the second die if I add those I get two I can roll a one and a two. Let's add those I get a three I Roll a one and a three and I get a four So you see I can I can get a three there, but I can roll a two and a one as well So that gets me three as well But those are two different rolls on the one die there the first I had a one and now it has a two and There are all the possible outcomes. I haven't repeated anything and I haven't left anything out There's no other outcome there and you can count all of them up. There's 36 possible outcomes 36 possible outcomes now because I don't repeat any and I've included all of them We call that mutually exclusive and collectively exhaustive. That's my whole universe as far as these two die are concerned So look at it. You can throw a two and only one way by doing a one and a one Again throw a three and two ways a one and a two and a two and a one I can hit four in three ways a one and a three a three and a one and a two and a two Etc. And you'll see there you're most likely to get a six seven because there's six ways to get a seven But I can work out a probability for that if 36 are all the possible outcomes. It's collectively exhaustive There's no they're not 37 ways. It's only 36 ways and six of them will end up being a seven six divided by 36 that gives me Simplified one over six or sixteen point six seven percent or as a fraction 0.16 seven So my probability of throwing a seven when our old two to die is 0.16 seven Beautiful It's really difficult for human being though as I said before to look at all these numbers and make any kind of sense out of that So let's do that So I'm going to create a space in memory a bucket. I'm gonna call it gonna call it dice underscore outcomes and instead of making a list I'm gonna make a lump pie array could have made the list as well But I'm not gonna import it into pandas or anything. So I'm just gonna call NP dot array Open and close my parentheses as normal and then square brackets And I just list all of them and it's two three four five six seven three four five So there we go. It's two three four five six seven three four five Just all that possible thirty six outcomes that we can get there all all of them all of them all of them Let's just run that and Let's just plot them now. There's going to be our first plot What I want to do. Let's just go sell all output toggle. So it's all hidden before we reveal the secrets So now we're going to just learn a bit about Macplot lip dot pie plot remember I used the abbreviation PLT. So I'm gonna say PLT dot figure It gets Python ready to draw figure for me and this is one argument I'm going to use one of the many arguments I'm gonna use fixed size and that's going to create a certain size of the figure and I'm gonna use eight comma six Eight units wide six units up. What do I want to plot? I want to plot a histogram and that is the code for histogram PLT dot hist and the arguments. There are two arguments here Dice outcomes is what I want to plot there. They are and there's the bins Remember there. I can only throw a two or three or four five or six seven eight to twelve There's only eleven of them. So that's why I say bins eleven make eleven bins for me It's a soon neat and things up PLT dot title So if I put some text in there, remember text is also always in quotation marks outcome of rolling to die X label is a label on the x-axis I want to call it outcome values either two or three or four or five or six or seven at the twelve and The y label the number number of occurrences. How many times does it occur? So this is all I need to draw my plot and eventually I want to show the plot of PLT dot show I leave my arguments empty and that's semicolon there if I leave this semicolon out It'll write some text before actually drawing the plot. So I like to put The semicolon there. Let's run that and there we have there's our title There's our y label. There's our x label and look at that There was only one way to throw a two. There were two ways to throw a three There's the four there were three ways remember to get a four and all the way up. There were six ways to hit a seven now Let's represent this as a fraction Because this is the total number six, but I want to express it as six divided by 36 And for that I'm going to use another PLT dot figure. I'm gonna say come on Python I Python notebook get ready to draw figure for me PLT dot figure and Fix size again eight comma six. You can omit that because if you use SNS It will do all of that automatically for you But I've done it there and remember our old friend the disk plot distribution plot I want to do dice outcomes 11 and the reason why I want to do that is This look at the y-axis now. It's changed forget this kernel density estimate here Just forget that for a moment. Just look at the complete re-representation of our histogram there, but now it's expressed as fractions So your chances there of hitting a seven there's the seven It's about 0.167 almost 0.17 which would be the line there So that gives me probability Now think about that. Just think about that for me Just think about it These Outcomes the variables that we're dealing with here. They are they continuous? No, they're not you can't throw a seven and a half You can't throw a seven point four five six you can either throw a seven six a five So they vary they discreet the values and Because they are discreet what we're going to do with discrete values is we're gonna say the width of this little rectangle here is One one unit Now you might be dealing with other discrete values That do have decimals or are much larger So it's not because we jump from six to seven and there's one difference between the two We say the width is one purely because it's discreet Now what is the area of a rectangle? Well, it's width times height and this height was 0.167 times the width Which is times one leaves us the area of this rectangle And all I'm doing now for you is equating area to probability So if you can express things so that the sum of all these areas equal one If I were to sum up all of these areas they're gonna equal one so all my values That's why I said it's mutually exclusive Collectively exhaustive the collectively exhaustive is there's no other outcome possible So if I added all of this up, it was going to equal one If I break it up now the area of this little rectangle tells me the probability of getting that outcome The probability of throwing a seven is the area of this rectangle It is as simple as that that is what p-value is to calculate a p-value a graph is drawn and the geometrical area remember the area of a circle pi r squared the area of of of a Triangle half of the base times the perpendicular height From that base The area of a rectangle with times height area. That's all it's geometrical area and That gives you probability so I can say what was the probability of Throwing more than 11 only thing that was more than 11 is 12 So I can work out the p value the probability of having thrown a 12 So p value and nothing other than area of course here. It's easy I'm dealing with discrete variables with discrete variables. I'm always going to have a base width of one So it makes it very simple to do Now I've shown you there the probability of rolling a two that was one out of 36 That's 0.028. So that's quite significant when you throw a one if you have a roller to a one and a one Just to let you know p value is nothing other than area So what the clever mathematics does behind the scenes is going to draw a curve is going to work out probability We have an immediate problem now because we've got to step up to continuous variables Now continuous variable You can you can Subdivide and subdivide and subdivide by definition at Infinitum Smaller and smaller and smaller more and more and more fractions So what are you going to make the base width now of your little rectangle now becomes impossible because it really Attends towards zero and you can't have a width of zero because then you're gonna have zero area So when we deal with continuous variables, we have a bit of a problem You can't just say the probability of a certain value now forget for a moment for me Please just this histogram and just concentrate on the kernel density estimate see how beautifully The beautiful normal distribution we have So say for instance, this wasn't dice. This was just a representation of Possible outcomes no longer is there tiny little rectangle underneath each point because if you give me a tiny little rectangle It's continuous variable. I can immediately divide it in two or three or four So the only thing that you can do is you can calculate the area under this curve Between two values or from a value more or from another value and less let me show you So let's do that. Let's just look at this graph for instance. There we go. You can remember this one from school It's just an upside-down parabola But there was a way to calculate area under the curve Between two and ten you can't use geometry because what is the height because the height there is different from the height there It's different from the height there. You could use calculus. It's as simple as that Now you don't have to do any calculus at all But what it boils down to is the central limit theorem and we're going to have a lecture on the central limits theorem The central limit theorem basically suggests the following you The all the possible outcomes and all possible studies is going to be normally distributed like this So any study you do if you're comparing two means the computer in its in its mind in the background is going to draw this curve and It is going to say the one that you found is some way It's going to work out the area under the curve from that point out from that point out and Then it'll tell you if something is clinically significant or not. Let's have a look at this one So There's two things going on here in this first one. I've taken I've colored in it's not Not done to to scale, but I've got five percent. I've chosen 0.05 as being very rare But I'm dividing it up into each side 0.025 so one on each side So that represents 2.5 percent of the area under the curve and that represents 2.5 percent of the area under the curve Very cleverly low this curve will always have a total area under the curve of 100 percent or one So this will be 0.025 of the area under the curve 0.025 So Very cleverly really in a very clever way, it's going to take your data It's going to draw this curve and I'll show you in the central limit theorem where this curve comes from and then it's going to say well This was your finding there And it'll draw a little line and it'll say well The area under the curve is more than 0.05 this green bit More than 0.05 Therefore your result was Not significant Now very quickly you can see a yeah, I've split my five percent into two and yeah I've put all of it on the one side that's called a two-tailed and a one-tailed hypothesis. We'll get to that but Think about it again with a continuous variable white cell count HB whatever you find You cannot have these nice little boxes. You're gonna have this type of graph It'll always be normally distributed. I'm gonna ask you well I must I mustn't say that I'll say that in quotation marks No matter how skew your data your data it does form part of a larger set Which is always going to be normally distributed by way of the central limit theorem will get to that the computer draws this graph it works out Specifically where these cut-offs are that would represent five percent of the area under the curve and It'll then take your specific data draw a line there and then work out what the area under the curve would be from there on out Or if it was on this side from there on out Or if it found it there right so you're right there from there on out Of course if your value fell there you'd have a small area under the curve Which would be less than 0.05 of the area under the curve So your P value is going to be significant So there so there we really have it P value is nothing other than the area of a rectangle there the area under the curve Always seems such that all the possible outcomes represent 100% of the area or 1.0 and yours is going to fall somewhere and that gives you the probability and you don't have to understand how that is done at All you write a few lines of code and the answer is going to pop out for you But it is nothing other than geometrical area the area under the curve Remember here with continuous data. I can't draw a little rectangle I can only go from one certain point out if it fell on this side of our little hump there It would be from this side on It'll work it out from here out the area there. That's just the way it works You cannot with continuous data just have one little block From there to there and work out a little a little block It's got to be from your side out from this side out And I can lump all my 5% on one side called the one-side hypothesis like as is the norm I can divide my 5% into two sides. So 2.5% on the side 2.5% on that side lovely stuff