 So today we're going to talk about randomness and probability and sampling and sampling distribution. So this is actually a very important notebook that we're going to work our way through. We're going to start off with by rolling some dice and flipping some coins so that you can understand this idea behind probability and how that is eventually the randomness in that it's eventually going to help us in data science to analyze the data that we work with. So we're here in notebook number seven randomness and sampling. So let's just have a look at the packages that we're going to use. Remember I use that line of code if I work on a MacBook. So in this instance I'm recording on a Windows machine so that's not going to be too important but we'll run that line of code anyway and then my usual %load underscore ext and then the google.colarep.data underscore table that's my magic command there just so that the tables print out nicely. And then things we've seen before so we're going to import from numpy or at least import numpy with a namespace abbreviation np we're going to import from scipy the stats module we're going to import pandas use the namespace abbreviation pd so let's do all of those and then for plotting we're going to do graph objects we're going to do io just to set the plotting underscore white theme for us there plot the express and then also figure underscore factory that's a different module and we're going to import that as ff and let's just set that default to plotly white. So we're going to talk a little bit about randomness you flip a coin you roll a die random numbers come up and a computer can to some extent simulate that because as human beings we're very very bad if i asked you to just say to say from zero to nine give me a list of a hundred random numbers or do it a hundred times so nine eight seven six four three there's no ways that human beings come up with real random values we form little patterns in our head so in somehow we've got to replicate that on a computer and what a computer does it uses an algorithm and it can use infinitesimal little time fractions on the clock cycles of your computer cpu there's all sorts of algorithms that it can use to come up with these random these numbers but they're not really random in as much as we won't go into to the depths of of this topic but they are actually pseudo random numbers because they come from some algorithm so we've imported numpy as np and we can do this seeding so numpy dot random dot seed so that seed function gives a value to the algorithm so that when it generates random numbers and you run the cell again i run it you run it on your computer we're going to get the same set of pseudo random numbers and we do that a lot in data science if we you know if we involve with teaching or if we want to share our code and we put those random seeds in there because we want those results to be reproducible so if we're putting in any integer value there i've put in 42 if you wrote this line of code and you put in 42 you're going to get the same set of pseudo random numbers as me and then in this random module there's also a choice function and what you do with a choice function is you pass a python list and this time i'm passing a list with two elements and they're both strings and what is going to happen is one of those two are going to be chosen at random and as i said i'm you know what the result is going to be when i do 42 because i've run this line of code before and if you do exactly the same you also going to get the same result so let's look randomly or see the randomly what the value is that was chosen and we can see that's heads so heads came out at random so this choice function also takes a second argument so i've set the comma 10 and this time it's going to do this 10 times over so it's going to draw heads or tails for me 10 times over but i've set the random c to 42 again so that if me and you if we both are in this line of code we're going to get the same pseudo random numbers and this time it's heads tails heads heads tails heads heads heads tails as simple as simple as that so let's ram things up a bit we you know this is a powerful computer we're running this in the cloud on google servers quite powerful machines obviously so let's go nuts so i'm going to this time set the c to none in other words now i'm going to get real pseudo random numbers in other ways if i run this line of code twice i'm going to get different values if you run it you're going to get different values so i'm going to create this computer variable called flips and to that i assign a panel's data frame so pd dot data frame and what i'm going to do is generate a column through just using index notation so a dictionary there so you see the curly braces curly braces so it's a dictionary so key value pairs my key is going to be flip so that becomes the name of the column and the values is numpy dot random duck choice heads or tails 10 000 times and let's look at the first five rows with the head method so that's going to run in it got heads heads tails tails heads as i say you run it you might get something different i run it again i'm going to get something different but it's 10 000 so the random choice function random duck choice function that's an equal likelihood for both so let's just see what happened in this 10 000 we can do because this is a data frame so we can call the series or just that column by saying flips dot flip because remember i made flip the column header and dot value underscore counts so that's going to tell me you know the frequency of the sample space element so it got 5000 and 97 heads and 4903 tails that's really close to each other it's theoretically we're going to get a 50 50 but we'll have to do a lot more than 10 000 you know to get closer to that theoretical 50 50 and there's actually a term for that and we'll get to that a little bit later and that's that how we get from our actual values to the theoretical values there's a little term for that so that's the coin flip let's roll a die a fair die so by fair we usually mean all the six values have equal likelihood of landing face up that's not some loaded die that's going to fall on certain values more frequently than others so we can simulate that as well and i'm going to create another pandas data frame and i'm going to call it die and to that i'm going to assign again a single column and i'm setting that up right there as a dictionary key value pair so die is my column header my values are going to be a random dot choice and the choice is going to be made from this python list that i've created up here sides and it's just a python list of six values one two three four five six and it's going to do that 10 000 times so let's do that so we've rolled our die within a fraction of a second we've rolled the 10 000 times and we can just create a bar chart so i'm going to do that from the plotly express module so px dot bar and then we want to know what the variable is we're going to do the bar chart of so x equals the sides so what are the sides remember if i hover over there we'll actually see the tooltip come up it's one two three four five six the six items on the y axis i want the count for each but just be very careful there remember what this value counts is going to do it's going to put the most frequent one at the top the mode at the top but we wanted to sort by the index and the index is one two three four five six so that you know the first bar is the value counts for the four for one and the second bar is the value counts for two so we do put this dot sort index method there and then from that we just want the values and we want it converted to a list i think you know by now you know what this little line of code does you can just tease it out as it goes along i'm going to put a title there frequency of faces falling 10 000 rolls and instead of x you know i'm changing the labels because it's going to put x on the x axis i want face values and instead of y i want frequency so let's have a look at what this looks like and you can see there now this is what we would term a uniform distribution and we'll talk about distributions a little bit later but you can see all the values are almost equal equal number of times they landed face up so this is you'll see later this is an empirical result the result from our actual data theoretically it should be a six a six a six etc because each value has one over six chance of landing face up so let's talk a little bit about probabilities because it becomes very important data science as we work as we work and it's not easy at times the first time you see this but if you sit down after the fact after having worked through it then it starts making sense so the first thing in probability theory and we use it it's a branch of mathematics very important branch of mathematics where we just investigate random events and it's all based on set theory so well most of mathematics all of mathematics the axiomatic system of mathematics is based on set theory and so you can really take very long and complex courses at university about set theory and what we're just going to view a set as as a collection of objects and you know we're using very poor language our language skills and our communal understanding of what the definitions are in a dictionary of these terms that we're using but let's just use that conceptualization is that a word of a collection of elements and those can be numbers or can be many things just this collection and that that is a set and that element in a set we would say it is a member of that set so there's some terms that we use and here terms that you've seen before intersection and union remember the intersection of two think of Venn diagrams remember those where they intersect in other words we're looking for members that are common to two sets so that member occurs in set number one or set number A set A and it occurs in set B and if that is a member of both it lives in the intersection of those two sets if I combine those two sets together then I have the union of those and remember where the union the little bit that's the intersection of a union those elements actually get duplicated because they occur in both sets that's why they're in the intersection so we just have to delete those duplicates and later that's going to become important as well but I think if you just think it through a little bit if there is an intersection between two sets of course the elements that are in the intersection are duplicated when we do the union and we don't want duplications in the union so it's just the elements themselves and then we have this concept of a universal set so imagine you're doing some study you're working in a lab or working with clients doesn't matter what it is and you're collecting data for specific variable there's a range of values for that variable but you're only working with 10 samples so not all the possible values that are in that interval that are possible for that variable is going to occur in your data set but that largest set we usually refer to that as the universal set and our sample space comes from that or sometimes is that universal set the sample space for variable in any case if not all the values occur in our little sample that are in the universal set the ones that didn't occur we call that the complement so those are the values that are that are in the universal set but that did not occur in our little sample so that would be the complement so terms you'll come across also terms like experiment event and outcome and sometimes they use very interchangeably but they are you know slightly different things I think if we can start with an experiment that's just the thing you do that's the thing that you set up and imagine it is an organism in a laboratory and you're going to take some fluorescent score from it you know that's your experiment you take you actively doing something so that you can get some result out of it that you can jot down or capture in your spreadsheet or in your database that's your experiment and you do that over and over again and then the event is that actual value that you're going to collect from that experiment there's a value but then there's also an outcome of of events and we'll see that when we talk about this flipping our our fair coin here twice so independent events means that if I do this twice the second one the first one has no bearing on what happens to the second one so that second one doesn't care what happened to the first one if you flip a coin twice the coin doesn't care what the previous flip was that's every time you flip that coin there's an independent event but now we change our experiment and we say my experiment is flipping the coin twice each time that I flip the coin there's an event but my outcome is jotting down both of the events so I actually have four outcomes if you think about it I can toss a flip the coin head head head tail tail head or tail tail so I can certainly do that and my outcome is now not just the single event but that combination of events I want my outcome is I'm throwing I'm flipping the coin twice so that event the what I'm recording heads or tails that happens twice so each of these four events if it's a fair coin they have an equal likelihood of occurring so flipping a head head would occur 0.25 or 25 percent of the time flipping a head in a tail would occur 25 percent of the time tail and then head 25 percent tail tail will be 25 percent and now I can change my outcome my outcome is something as a researcher that I state I want to know what the probability is of at least one head that becomes my outcome now so you see this idea of an outcome as we fluid that is my research question and in those cases well three of those have at least one head in it the first one has two heads in it that is at least one head the second one is head tails and the second one tail heads each of them has a head in it so only the last one tail tail that has no heads in it so if my outcome is the probability of at least one head that is a probability of 75 percent the 75 percent probability of at least one head and you can see starting to develop the notation here p and then in parentheses at least one head later on we won't use a full word we'll just use a symbol to to mean at least one head so the probability of at least one head is 75 okay we researchers we can change our outcome this time I'm going to flip my coin three times and my experiment is flipping a coin my event is recording you know what have what came in my outcome is I want all three here so I've got all these different outcomes head head head head tail head tail head tail head head tail tail tail tail tail tail tail tail tail so now my question can also be what is the probability of at least one head that's now my outcome of interest of all my outcomes this is my research question I want to know what the probability of at least one head is and if we quickly look down there we see it's again it's only this last one so the probability will be seven out of the eight so if we say seven divided by eight of course we're going to get 0.875 so 87.5 percent that is my theoretical probability so let's do an experiment so we're going to set up it's going to run some code here let's have a look at that let's see numpy.random.choice and what are we going to choose from from this array this numpy array with zero and one so let's call that tail and head and I want three of those and I say replace equals two and we've got to do that because there's only two choices there and I can't select zero the first time one the second time then there's nothing left if the coin is flipped and it's a zero we throw that zero back into the bag to be randomly selected the next time so let's do that if I were to do that we're going to see zero one zero so that would be tail head tail and if you run it again you're going to get something different so let's do this five times and I'm going to do this with list comprehension so list remember that goes inside of a set of square brackets I'm going to say numpy.random.choice in this array zero to one three times replace equals two for i in range five so that's going to be zero one two three four so it's going to happen five times so let's see at random what our coin flips were it was tail tail head head head tail head tail tail head tail head head that's what happened okay fair enough we understand how the code works let's do this many more times but what we want to keep track of is how many times there's at least one head now what we can do of course we can just sum over that each of those little arrays and if it's one or more you know or at least greater than zero that signifies there was at least one head in there so that's what we're going to do so I'm going to see the set of random number generator with the integer two that's the choice I'm going to have a counter count and I'm seeing that two zero and then I'm going to say for i in range one thousand so we're going to run through this for loop a thousand times I'm going to say flip underscore three is the sum of that little random values that I have and then if the flip is this flip underscore three is larger than one I'm going to increase my counter by one so that I know that at least there was one head in that specific set of three numbers and then I'm just printing the results out to the screen and so let's run that and see how close we get to this idea of 87.5 percent and there we go we see a total number of experiments was a thousand and the total number of at least one heads that's my counter that kept on counting every time the value was larger than zero so there was at least one head in there it was 87 877 so that gives me a probability of 0.877 so my empirical my results here my actual values for my experiment is very close to the theoretical 87.5 percent and that's from a thousand times rolling the dice three times so you start getting this idea of the theory what it should look like and then our experiment is not quite there because it really depends on the sample size how many times we do this now one little thing I think you must always be this is the something you just absolutely have to know probabilities range from zero to one you can't have a probability of something happening minus three percent of the time that this that that's not how it works and you can't have the probability of something happening 110 percent of the times it's from zero percent to 100 percent inclusive or in fraction terms a zero to one zero point zero to one point zero and if I have all these events and I you know put them together it must add to one as well so the probability of something occurring an event occurring plus the probability of the event event not occurring must also be one so the probability of the event not occurring is the one minus the probability of the event occurring such a simple thing then equation one but those two facts and you know the probabilities must be between zero and one and they must sum to one and so the probability of something not happening must be the one minus the probability of it happening because that brings us to this idea of negation which I want just to use equation one then in a little example so we're going to roll our fair die again and we're going to ask what is the probability of rolling a six given a single roll of course it's one out of six but then what is the probability of at least one six in two consecutive rolls so only one of them need to be a six they didn't both need to be a six then what about the probability of there being at least one six if i roll it three times or four times or more times well what we're going to use here is this this idea of equation number one there and we use this idea of the event not occurring instead of the event occurring sometimes it's useful to look at the event not occurring so what is the probability of not a six so that's symbol you see there in equation two that's a mathematical notation for not a six the probability of not a six well that's going to be one minus the probability of a six the event not occurring is one minus the event occurring so one minus one point one over six that's five five six so that's the probability of not rolling a six and what we can do here is to start squaring this if i roll it twice it's one minus the probability of not a six squared so one minus five over six squared so that's eleven out of 36 and the probability of a six in n rolls then is one minus five over six squared so let's just look at the code there so i'm going to say computer variable six one minus five over six to the power n for n and range one to eleven so one to ten times so the probability of at least one six in a single roll is one over six it goes up to 30 percent if you do it twice and if you do it ten times in a row the probability that there will be a six at least one six in those ten rolls are actually 83.8 percent so we can do a little scatter plot of that let's just have a look at that and you can see the more rolls i have the higher the likelihood that's going to be at least one six by the time we get to 40 rolls it's 99.9 percent that there will at least be one six in there so just remember sometimes we have to use negation okay so that's all fun you know you can play around with that play around with some code see what you can come up with more importantly we've got to talk about types of probability and because we're going to work with this kind of thing and the first one we're going to do is is unconditional probability so we're actually going to talk about unconditional and then also we're going to talk about joint probability and then if we scroll down a little bit and there we go our conditional probability so let's talk about unconditional probability i'm going to create a little scenario here we have two containers and in container number one there is some red balls some green balls and blue balls and the same goes for container number two and we're going to look at the relative frequency and i'm just going to set up a little data frame here with hot coded values so let's have a look at this on the left hand side i have container number one and container number two and this is the relative frequency so in container one red balls that makes up 26 percent of the total of all both containers in both balls so it's relative frequency 36 percent green balls container one 18 percent it's a blue ball in container one nine percent it's a red ball in container two seven percent it's a green ball in container two and only four percent it's a blue ball in container two so if i sum all of these up these are relative frequencies so relative probabilities there that has to sum to one and that's exactly those six values sum to one so no problem there but now i can also look at the sum totals for the different rows so let's do df dot sum and i'm going to set axis equals one so what's that that's going to do for me it's going to just add up all you know the column values for each of the two rows that's what axis one means so 80 percent of the balls are in container one and only 20 percent of the balls are in container two but if i do it by axis equals zero that's going to be the other way around so now we're going to get the this relative frequency of the balls 35 percent of them are red 43 are green and 22 are blue so down the margins we can write container one is 80 percent container two is 20 percent and if we go down the bottom we can say red is going to be 35 percent green is going to be 43 percent and blue is going to be 22 percent and we can just sum up these totals because they relative frequencies of the whole so 0.26 and 0.09 is going to be 0.35 and 0.18 and 0.04 that's 0.22 so the blue balls would be 0.22 or we can sum along these so this is quite common to have this sort of thing so we you know we've got to spend some time on that so now we're going to talk about this unconditional probability we see the unconditional probability of a green ball is 0.43 because it doesn't matter which container it is in if i just look at that it's it's a green ball it's the 43 percent chance because the 35 percent chance it's a red ball and 22 percent chance it's a blue ball if those balls were all put together it's independent of the container that it's in now we're going to talk about the joint probability though so we are interested in two outcomes here two things have to be they are joined it's got to be a certain color ball from a certain from a certain container so let's have a look at this if it's container number one and it is green it's the intersection of this idea of it being green and being in one and that's just our relative frequency there if we go to the chart it's green and it's in container one that's 0.36 that's the joint probability so both the container and the color joint probability both have got to be true is it on container one yes is it green yes they both have to be true that's 36 percent and that is our joint probability and now comes the important but conditional probability now that's the probability of an outcome given that something else has occurred in this time that other thing that has occurred has an influence on this thing occurring it's not like flipping two coins in the when you flip the the second time what what happened before is of no consequence now there is some consequence so this makes it infinitely more interesting to us so the question that we're going to ask here what is the probability that it is from container one if it's a green ball so you've got the two containers and someone you know your eyes are closed someone swaps them around around and you know randomly you take one ball out of it and i'm asking what is the problem and you open your eyes and it's a green ball so you're asking yourself what was the probability that this came from container one versus it came from container two remember if we ask it comes from container one that would be one minus it coming from container two but that's not what we're interested in here we're interested in what is the probability that it came from container one given that it was green of course that's going to matter because the relative frequencies are all over the show and in equation six we see this idea of conditional probability and so we say the probability of a given that b occurred that's equal to b and a having occurred divided by the probability of b so let's just have a look at the numerator there the probability of a and b a and b joint probability so this is the ratio of the joint probability divided by this unconditional probability at the bottom so it's the probability that it's in one and green we know that's 0.36 divided by the probability that it's green and the probability that it was green depends on nothing other there's blue balls red balls green balls and that was 43 percent so in the end that's going to be 0.36 divided by 0.43 as simple as that okay so a we're going to say it's it is container one and b that it's green so what is the probability of a it's from container one a it being green given that it's green you look at it in this green b so that's the joint probability a and b divided by the second one the given something has happened the probability of it being green so that's 0.36 divided by 0.43 so if we run that code you see the result there it's going to be 83.7 percent the probability that that comes from container one now container one at 80 percent of the balls but you know green was a certain likelihood as well so given that it is green it has suddenly raised the probability a little bit that this was actually from container number one remember then the probability of being from container two would be one minus 83.7 you know 100 percent minus 83.7 percent so how do we know if something was a conditional probability or if the two events were independent of each other we can actually test for that and what it would do is the probability of a given b we've just seen the equation for that here an equation six how to do that if that equals the probability of a and if it was probably of b given a that that is equal to the probability of b you know those two things are equal to each other then it was actually the events were independent of each other it did not matter what happened before what we were given so we see an equation two there that you know writing it out twice a given b is a and b over b and b given a is a and b over a so we've done this here let's have a look at it the probability of a given b a given b so that's that's this joint probability divided by the probability of b being 80 so probability of being in container one given that the ball is green and the probability of b given a so the probability of green ball given that it was from container one so this time you know what container came from what is the probability of it being green and then the probability of a green ball probability of b being from container one so we're just saving all of those up as as values and I'm asking is the probability of a given b equal to the probability of a no it's not is the probability of b given a equal to the probability of b no it's not so I can say these two events are dependent upon each other so that's how you would measure if if events were independent of each other so joint probability remembers means and so both have got to be true it's got to be green and it must be from container one but what if it's all so here we're talking more about the union of something it's in if you think of the union of two sets that's all it's either in one set or it's in the other set or it's in that intersection between the two that's what union means it can be in this combination of them it can be in one or the other one so what we're talking about here is the probability of it being an a or an b and think about a Venn diagram those two circles in the intersect somewhere now that union of those two is the probability that it's in the one and the probability that's in the other one or plus I mustn't say and that that might be confusing probability of one or the probability of the other one but remember what we said before that intersection those things get countered twice so we better subtract that intersection the probability of it being an a and b so the probability of a or b that's a probably of a probability of b sum those two up minus the intersection the joint probability so the probability of it being red we sum up over all the reds the probability that's two sum up over all the twos the probability of red given two is 0.09 and if we just sum all of those up and subtract the intersection we get 46 so that's the probability and this time I didn't say the colors I was interested in this time it's the red one or it's in container two so you draw it what is the probability that it's red or it comes from container number two so that's when we're going to use this idea of this Venn diagram and this union of them minus the intersection of them and then the last one or second last one is just what is the probability that our random ball is both red and in container two you can see the equations there but you can see it's only it's just simple algebra of what we have before because if I divide this first little one if I divided that by probability of b then I have again probability of a given b is the probability of a and b divided by the probability of b so it's just a very simple algebra from what we had before and if you run through this code what you're going to see well that's just the relative frequency isn't it it must be both red and in container two well if we go back to our little table there that's exactly what it was that that joint probability that's nothing other than the relative frequency and again we have this idea of the complement so if it doesn't occur it's one minus the probability of it occurring given some universal set so let's work through a very interesting example here and I wonder if you're going to get this one right um um if we were to play some bets I wonder what uh not that I'm a betting person at all or condone it in any way but fun nonetheless to think about what you're going to select as a solution so imagine we have three cards one is red on both sides one is white on one side uh a white on both sides and the other one's red on one side and it is white on the other side so red red white white and red white now we're going to just shuffle him under the table and we're going to select a card that's chosen at random and you get to see one side and that one side is red and I ask you what's the probability that the other side is red so don't read any further just think about it don't read so red red white white red white and I pick one and I show you the one side you don't know what the other side is and at front side that you see is red what is the chance that the other one is red well let's just think it through with what we know about probabilities now so what are all the different outcomes that we could have from these cards it could be red red so I show you red and the other side is red it can show you red red again because there are two red sides think about it I can draw the card and show you red side number one for that red red card or I could have drawn it and shown you the other side of the red red card so that's two red reds think about it the red white card I could have picked it up and shown you the red side front and then the white would have been at the back or I could have picked it up and shown you the white side and the other side was red and then the same argument for the white white card I could have picked it up shown you the one side white and the other side was white or I could have picked it up and shown you the other white side and the other side was white. So there's actually six outcomes. So now I think about it. If I show you a red, that's in the front, that's going to be one of those first three ones. And look at what it's at the back. There's a red at the back, red at the back, or white at the back. So the probability that the other side was also red is two-thirds, and I think many people would have said it's a half, but it's not. It's two-thirds. So that's just a little teaser there to mess with your mind, and congratulations if you've got two-thirds the first time round. So always come up with all the possible outcomes. That's what it's all about. Come up with all the possible outcomes. So now I want to talk to you about random variables. So a random variable is a function, a mathematical function, and that maps the outcome of an experiment to a number. So we're just actually in our spreadsheet capturing a number. And by number I'm using sort of a loose term. I can encode a categorical variable as a number. You know, so it doesn't matter what the variable is. Let's just call anything that I then capture. Let's just call that a number. So very abstract definition here. So I'm just mapping the outcome of my experiment to a number. So imagine then that I'm rolling to die, a pair of dice, and I sum up the values that land face up. So there are all the outcomes if you think about it. I can, you know, both can be one and that's a two. I can have a one and a two that's a three, or it can be a two and a one and that's three. I can roll a two and a two, that's four, a one and a three and that's four, and a three and a one that's four. And for a five I can roll a two and a three, and a three and a two, and a one and a four, and a four and a one. So you can look, have a look at that. There's only one way to get a two. There's no ways to get a one. I mean that one dice, you're rolling two dice. So you can't sum two one. But there's only one way to get two. And right on the other side of the spectrum, there's only one way to get a 12, a six and a six. But if you look at a three, there are two ways to get a three with these two dice. One can be a one, and the other one a two, or the first one can be a two, and the second one can be a one. If we look at a four, you know, there's four ways to get a three ways to get a four. Two and two, one and three and three and one. And there's one, two, three, four ways to get a five. And there's one, two, three, four, five ways to get a six. And there's six ways to get a seven. And then it drops down again. Fewer ways to get eight, fewer ways to get nine, fewer to get 10, two to get 11, one to get 12. So it's most common to get a seven. So there's this pattern to this random variable of us. So I write it down. So what we would say our random variable is x. So it gets a bit confusing. That x is just going to be our column header. That becomes our variable. And the values that we jot down, those are random variables. So it gets slightly confusing. So I try to stick to the fact that we call that column header in our spreadsheet. That's our variable name. And we have data point values for that. But each one of those values technically, that is a random variable because it maps an outcome to a number. So the probability that x is two, that our random variable, x equals two, that's one over 36, because there's 36 of these, one, two, three, four, five, six, 36 possible outcomes. So two, there's only one way to get a two. For three, there's two ways to get a three. And up to the seven, there's six ways to get a seven. And then it drops down to one way to get a 12 all the way in the end. So you can just run that little list comprehension there. And we can see the probability. So the highest probability was a seven at 16.67%. So let's just plot that as a little bar plot because remember these are discrete variables. And you can see this nice, nice distribution. So for a seven there, that was the highest probability of occurring that occurred most of the time. And now we can start thinking about probability in a new way. We can think about it graphically, because I can ask, my outcome could be what was the probability of my random variable being 10 or more? So this is how we're going to write it, probability my random variable x being greater than or equal to 10. Well, I just sum up of these, the one for 10, 0.0833 plus 0.0556 plus 0.0277. I can add the area of these rectangles. I want you to start thinking about that. So I can say 0.083, 0.055, 0.028. So I'm just rounding off there a little bit. So I can now start saying that the probability of my outcome, my outcome of interest is a 10 or more. If I roll to die, what is the probability that the score is going to be 10 or more for the summing up those two face values? Oh, that's 16.6%. 16.6%. So just start thinking about that. So I want to start thinking about this idea again of a theoretical distribution and an empirical distribution. So Sydney, this was theoretical. We worked this out by pure numbers. So let's roll some actual die or at least we're going to simulate them. And this is going to give us an empirical distribution versus our theoretical distribution that we saw there. So I'm going to set the random seed, the pseudo random number generator. I'm going to see that with an integer of 3. I'm going to say roll underscore totals and I'm going to create a data frame object. And I'm going to have one column. So I'm setting this up as a Python dictionary with a key value pair. My key is going to be the column header. That's roll total. And I'm going to do list comprehension 10,000 times. I'm going to draw from this random dot rand int. And rand int is almost like choice. I'm just giving it the lowest value and the highest value low of one high of seven. Remember, this is Python. The one is not going to be included. So it's actually going to be one, two, three, four, five, six that it chooses from. So that's just a different way of doing this random choice thing, using rand int, random integer between a whole number. In other words, between one and seven, but seven's not included. So it's one, two, three, four, five, six. Choose two of them. And then I'm passing that to the sum function. So I'm summing those two. So I'm just simulating a rolling two dice. And I want that to be converted to a list. And I do that for I in range 10,000. So I'm doing that 10,000 times. And let's look at those sum totals then. So the first time I rolled a four, then a six, then a two, then a seven, then a 10. So you get the idea. So let's do value counts to see, you know, how close we got to this. And lo and behold, seven was most likely at almost 17%. So, you know, not too bad. But let's just sort it by this index rather than the mode, the most common first. So I'm going to use the dot sort index method for this. And that's going to sort by the index. So it's going to go two, three, four, five to 12. So you can see this idea, 1,699 of my 10,000 rolls. That was a seven. And it was very unlikely to roll a two or 12. So let's just create a bar chart of that. So this is going to be my empirical distribution. And it looks very much like my theoretical distribution. So there's my empirical distribution of values. And again, I can ask what was the probability of there being a 10 or 11 or 12. So I'm just going to sum up. There was 854 of those, 559 of those, 299 of those. Sum those up divided by 10,000. That's the probability of a 10 or more in my empirical experiment that I set up here. Good. So let's talk about distributions then. It's this pattern that we get from our data, how many times occurred, their frequency or then their relative frequency. So imagine now I'm going to do the following thing. I'm going to set up this normal distribution. So without us me explaining explicitly what a normal distribution is, I think you've got quite a fair understanding of it already. Most people have that intuitively, this bell shaped curve. So near the middle, near the mean or the average, it's more likely to have values there. And as they go away with a symmetric bell shaped curve towards the smaller side and the bigger side, those values get less. So there's the idea of a mean and a standard deviation. So we have this norm dot r vs function in the stats module of sci-pi. So stats dot norm dot r vs. It's going to work very much like the numpy dot random. It's actually just slightly more powerful. There's a few more things you can do, but it's really nothing other than numpy dot random dot one of the random functions that you do get. But here you get stats dot norm dot r vs. And I can set the loc location. That's a keyword for mean. So I'm setting a mean of 160 scale is the keyword for standard deviation. So standard deviation of 10. I want 200 values, please. And I'm sitting a random underscore state. That's the same as sitting numpy dot random dot seed. Yeah, we have an argument in the r vs function, a random seed. And I'm going to assign that to the computer variable height. So what I'm imagining here is 200 people and the mean of their heights is 160. The standard deviation is 10. Just from that distribution, please draw me 200 random values. And there we have it. So let's draw histogram. Now height is now continuous numerical variable. So we're not interested. We're not interested in a bar chart anymore. We now want histogram. So there's our histogram. And you can see I'll plotly decided that it's going to do these steps of five centimeter increases 150 to 154.99. And the count of this was made up 0.135 of all the values. So let's go to the top there between 160 and 165 or then 164.999. It made up 22% of the total. So this is a nice little histogram. So the reason why that happens is because I set his norm equals probability. So we don't get the actual values. We get this relative frequency. So I just want to show you this little other plot. I didn't think it's going to be around that much longer. This figure factory, they are changing plotly changing it. But we get create the create underscore this plot that's going to do exactly the same thing we're going to hit the histogram. But I've also done this nice little curve. It's kernel density estimate. So it gives us this nice little bell shaped curve instead of this blockiness, of course, of our histogram. But it's trying to tell us the same thing other than the fact. Remember that this is an artificial binning of our values. They're actually a continuous numerical variable. So in the end, this is really what we want to deal with is this smooth curve. And we're going to see a lot of these smooth curves. And you're going to understand exactly what that means. What we're trying to get to here is this, what I said right in the beginning, there's this empirical distribution and the theoretical distribution. And there's this law of averages and this law of averages that really states that our theoretical or empirical distribution that is going to start approximating our theoretical distribution. So empirical distribution is going to start to approximate or get close to the theoretical distribution if our sample size gets bigger and bigger and bigger. And that's something I think most of us in research also, you know, have heard a million times and we sort of understand that the larger sample size, the closer we are going to get to what the actual values are. So this idea of the law of averages. So let's look at this at play a little bit. So what we're going to do here is something very interesting here at the end. I'm going to simulate the population of 10,000 subjects. So I'm going to imagine in a lab or working with some economic data or some human data, it doesn't matter what it is, some astronomical data really doesn't matter. I have 10,000 subjects and I'm going to measure this random variable in them. So this is going to be this variable and we know what it is in each and every one. Just imagine we know what the value is in each and every one of those 10,000 subjects. And we're going to make as if it comes from what we call a chi-square distribution with degrees of freedom of three. Don't worry about that at all. So let's just set that up and that's my population. So in my population, there's a certain variable and we measure them and we know what it is in all 10,000. Hardly ever do we know this in real life, but this is a computer. We can simulate these things. So random dot C dot 42. And then I'm going to use the round function and just round to one decimal place just to make things easy. But I'm using numpy dot random dot chi-square and there's its parameter three, three degrees of freedom and I want 10,000 of those. So I said, don't worry what it means. That's not important, but I want to show you what it looks like. So here's a histogram of this variable for all 10,000. You can see what a chi-square distribution is. It goes up till here about maximum here and then sort of has a long tail on the right hand side. So we know this, we know what this is for the actual whole population. But now as researchers, we can't get 10,000 subjects in our study. We only select at random 30 of those individuals. So 30 subjects from our whole population of stars or whatever it is. We only take 30 of them and let's look at a histogram of our 30. And sort of we can imagine it forms a chi-square distribution to some extent. But from a sample of 30, you know, we're not quite sure. But what we're going to do now is we're going to increase that sample size. And of that population, I'm increasing the size of my study and I select 100 samples. So let's do that, 100 samples. And now it looks almost like some other distributions, but sort of suppose we can start seeing chi-square. So let's ramp it up to 10,000. And now we see the chi-square distribution starting to come out. So my empirical distribution is starting to resemble my theoretical distribution. Now, it wasn't really theoretical because my whole population was also actually only a sample from something that was theoretical. But I think you get the point, the larger the sample size, the more it's going to approximate the population from which those values were taken. And now for something really interesting. And this is really going to be one of the most important topics in data science. We're going to talk about sampling distributions. So let's stick to our population that we had. And we're going to take 30 samples from our whole population of 10,000 subjects. And what we're going to do is we're going to calculate the mean of our tiny little samples. Have a look at this. We went out, we planned our research, we took it to an ethics committee. If that's what you have to do or no, we submitted it some way for approval and we got a grant to do it. And we're going off now and all we could do because it's very expensive, either in time or human resources or finances, we can only take 30 from that. And we take our 30 samples from this population of 10,000. We have this idea of mapping our event to some value that we can do our random variables. In other words, we collect this value for this variable and this 30 sample size of ours. And we get a mean of 3.183. 3.183. And that's what we're going to report on. That's what we have. But, you know, is this representative of the actual population? Well, presumably we took this at random. There was no bias in how we selected that. But, you know, we're not quite sure and we need to, you know, we need to do some statistical analysis, perhaps on this, you know, express how certain we are that 3.18 is actually the mean of our population. Now, I want you to imagine you have infinite riches. So you don't only do your study once, but you do it 50 times over. Every week, someone gave you lots of time, lots of resources, lots of money and you do an experiment again. And the next week again. And the next week again. And every time you take 50 or 30 sample of 30, you know, at random, some of them might be the same one. Some might be different ones. But you're doing this 50 times over. Every time selecting 30 and every time you select your 30, you calculate the mean of your 30. Now, you're going to have a bunch of means. You're actually going to have a distribution of means. And that's what we call a sampling distribution. The sampling distribution is this pattern of a statistic. Remember, a statistic is something we calculate. When we did a summary statistics or descriptive statistics, a statistic is a value from a sample. A value from a population. So the mean of a population, that's a parameter. But some calculation point estimate or measure of dispersion from a sample that's called a statistic. So if you had a bunch of statistics that they form a distribution, and that's exactly what we're going to imagine we have the power to do here. So I'm going to create this computer variable mean underscore 50. I'm going to use this comprehension there. Let's see what it is. It's going to calculate the mean of a random choice of 30 from my population. And it's going to do that 50 times. So before I end range 50. So what I'm going to have now is I can imagine, as I said, all the riches in the world and I can do my experiment, my research 50 times over. And let's see. Remember when I did it once, my mean was 3.183. Now every time it's going to be slightly different because I'm going to have different members from the population in this. So let's have a look at what the histogram looks like now. So if I do this 50 times over. Now that looks a bit strange because this distribution does not look like a chi-square distribution anymore. The empirical distribution of the actual value. Those are empirical distributions of the actual value. This is a sampling distribution of a statistic, not of the actual values. So let's ramp that up. And now I'm really rich and I've got oodles of time and resources. I'm going to do this a thousand times over. So I'm going to imagine that I can do my exact same research a thousand times over every time I choose 30. And have a look at what happens here. This is starting to look like a normal distribution. A sampling distribution of the means. So the sampling distribution of a statistic and you can do it for standard deviation, for variance, whatever you like. There's going to be a pattern to that as well. And that eventually is going to allow us to do some data science to understand the story that's hidden in our data. Our result, because we can only do our research once, is going to be one of many possible outcomes. And that is going to tell us if our outcome that we found was a rare event or it was a very common event. And that is what we're building towards.