 Statistics and Excel. Bell curve. People's weight data example. Got data? Let's get stuck into it with statistics and Excel. First a word from our sponsor. Yeah actually we're sponsoring ourselves on this one because apparently the merchandisers they don't want to be seen with us but but that's okay whatever because our merchandise is is better than their stupid stuff anyways like our accounting rocks product line if you're not crunching cords using Excel you're doing it wrong. A must-have product because the fact as everyone knows of accounting being one of the highest forms of artistic expression means accountants have a requirement the obligation a duty to share the tools necessary to properly channel the creative muse and the muse she rarely speaks more clearly than through the beautiful symmetry of spreadsheets so get the shirt because the creative muse she could use a new pair of shoes. If you would like a commercial free experience consider subscribing to our website at accountinginstruction.com or accountinginstruction.thinkific.com you're not required to but if you have access to one note we're in the icon left hand side one note presentation 1626 bell curve people weight example tab we're also uploading transcripts to one note so you can go to the view tab immersive reader tool change the language if you so choose so that you can either read or listen to the transcripts in multiple different languages tying in the time stamps to the actual video presentations. One note desktop version here in prior presentations we've been thinking about how we can represent different data sets both with mathematical representations such as the average or mean the median the mode quartiles and pictorial representations like the box and whiskers the histogram the histogram being the primary tool we visualize when thinking about the spread of the data being able to use descriptive terms to describe the spread of the data on a histogram such as the data is skewed to the left the data is skewed to the right we then wanted to think of formulas to graph lines or curves that may in certain cases approximate the actual data sets whenever we can do that it would be great to do that because the formulas allow us to have more predictive power over whatever the data set is representing we've talked about different kinds of data that might conform to different curves in the past such as a uniform distribution binomial distribution poisson distribution exponential distribution for example we're continuing on with the most famous of them the normal or bell curve type of distribution remembering that not all data is going to conform to any of these types of distributions we could have a completely I data that just has no pattern that could be represented by a simple curve but many things could be represented in nature and when we're thinking about bell curves a lot of things in nature like heights like weights like if we're thinking about how close estimates are to a particular thing these are often things that might be represented by a bell curve now if you don't have any actual data to practice with you might want to check out Kaggle.com to practice to get some data sets to practice with in Excel or you can actually create your own data sets using a tool in Excel that we have talked about in prior presentations so this time here's going to be our data on the left where we have height data and weight data one of the primary differences with this data set then in prior data sets that we have looked at is that we have a whole lot more data this time which means that if we're talking about something that will conform to a bell curve it's more likely that if I create a histogram the pictorial representation of the data then it's going to look more bell shaped because we have a whole lot more data if we have less data it still might be something that should conform to basically a bell shaped distribution but we don't have as much data to kind of represent it's not going to look like as much of a bell shape even though it might be clustered more jaggedly in a bell shape type of position also we're going to do a little bit different graphs in terms of graphing our bell curve to ask questions such as what if we're talking about the area under the curve at the top of the curve or at the bottom of the curve or in the middle all right so this is our data same kind of thing that we've done in the past we want to then look at the data and say does this data conform to a bell curve type of distribution and if it does then we can plot the actual bell shape distribution and ask questions about it from there so we're going to take the mean of the data which is simply the average of the data this would be the formula in Excel taking the weights we're looking at the weights here all of the numbers divided by the count of the number and the average in pounds is 127 pounds on the average the standard deviation then is calculated this way standard deviation for the population in this case is just going to be all of the weights and that gives us 11.66 a measure of the spread of the data the median which is the one in the middle calculated thusly with this formula equals the median of this data if we were to sort the data from bottom to top then it would be the one in the middle picking the one in the middle which is 127 pounds about that's quite close to the mean which is an indication that this data might conform to a bell curve so at this time we're saying hey look it's weights so that's something in nature I'm already thinking this might conform to a bell curve based on my intuition there then the mean is the same as the median that's another indication if I take the mode which is the one that's going to be the number represents or comes up multiple times I'm looking at this formula just the single mode note that it's a little less likely if we didn't have so much data that the mode would be useful because we have decimal points so so so that means that if we didn't have a whole lot of data it would be less likely that we got we got a whole lot of multiple occurrences of the same exact measurement because we're pretty detailed on the measurements whereas if we didn't have the decimal points then then it would be quite likely that the mode would be very useful because we would have multiple of the same number more likely even with a smaller data set but because we have such a large data set in this case then we even the mode is still relevant and it's pretty close to the mean which is another indication that this might conform to a bell shaped curve so if we were to plot this I think I plotted this down here it looks like this so this is a histogram so we have that middle point over here and you can see that it doesn't look like a smooth curve because it's still a histogram but because we have a lot more data then it's looking a lot more bell shaped than some of our other curves remember that even if we were looking at weights and we had a lot less data then then it still would conform to like packed up in the middle but it would look a lot more jagged due to the fact that we don't have as much data to be representative so so we're saying okay it looks a lot like it's going to conform to a bell curve now so now we're going to say all right well then let's plot this thing out and look at the smooth curve or or the graph that we can make based on our formula or plotting of the x so I'm going to say let's let's plot this out using our norm dot dist we want to take the x's now where am I going to start the x's this is our common questions we've run into in prior presentations we're looking at pounds so I might say well why don't I start at zero pounds and go up to a number of pounds like 500 or something and so the unlikely they're going to hit 500 pounds that would be quite heavy of an individual but it's also unlikely it's going to go down to zero so we probably don't want to start at zero so where do we want to start well we know that if we go for standard deviations that's going to take the vast vast majority almost a hundred percent of the data for or even though it goes on forever the bell curve in theory on the tails but the vast majority of the data will be in there so we're gonna so let's do the four standard deviations which is what we've done in the past so the lower bit then is going to be the standard deviation of 11.66 times four for standard deviations 46 64 minus the mean will take me to the lower end so I'm going to say minus 127.08 and that gives us our 80.44 so I don't need to go down to zero pounds I can go down to 80 there are not many people that are 80 pounds that are not in trouble or a full adult at the 80 pounds is not general so then if I go the other way 11.66 times four is 46 64 plus the middle point of 127.08 we get 173.72 now there are you would think you know there might so our data set is going up to 173 on the high end so we're gonna say okay that'll be the high point of the data that we're using so I'm gonna say all right then if I go over here we can say the X's are just gonna go from 80 on up to that to that to that high point of 174 and that should capture the data that we need to be plotting in the graph then we can do our P of X's which is our norm.dist formula norm.dist the X in this case 80 the mean standard deviation are gonna be these two numbers the 127.08 the mean standard deviation 11.66 in our formula and we do not want it to be cumulative that's what's being represented by the zero so if I do this all the way down you could say for example this one what's the likelihood that we have someone at 93 pounds given our normal distribution point oh five percent what's the likelihood that we have someone at 96 pounds given this point one percent so notice that the questions that were likely to ask are what's the likelihood that we have someone on the high end maybe that's like 148 pounds or above or what's the likelihood that we have someone at 110 pounds and below for example those so those so you might think that we could sum these up but you can't do that exactly typically because we're talking about the area under the curve although that will give you an approximation now we could compare this to our actual data set so our actual data set over here is counting we're imagining that this is counting an actual distribution right so we're gonna say that these are actual numbers in pounds but this is in percents so what I'd like to do is count my my data over here and then convert it into percents so I'm gonna do my frequency distribution so for example this frequency distribution I want to say what's how many people in our sample or in our population have greater than 84 pounds up to and including 85 pounds three of them how many people are over 85 pounds up to and including 86 pounds two people the frequency formula is up top which is taking the frequency of the data array it's a fancy array formula and that's gonna be this all of this data and then it's taking the the other side is the the bends which are the X and then it spills out it's a spill formula these this data the total of all of this data if I add it all up notice that 146 we had 239 in our data set for it if I add it up we come up to 25,000 data points so I didn't put all those data points in here because I only went down to here but in Excel it went on to 25,000 so in other words if I was to count this data with a formula of count this Excel count every line just count them 25,000 that's why we have a lot of data that's why when we grafted we have a pretty smooth graph and we can verify that those data have been properly put into bends at least to some degree by the fact that the total down here should tie out to the number of data points at 25,000 which are now being allocated to the bends then we can say okay I could take my data for example here and say what's the likelihood of someone being 90% based on the bell curve versus my actual data well I could convert the bell curve times the 25,000 25,000 times 0.9 0.009 because it's 0.9% and I get the 225 which is pretty close what's the likelihood I could take my 25,000 times the 0.0235 and I get pretty close to that 558 or I can take my actual data and convert it into a percent by taking each of these divided by the total so I can take for example let's go down here where we have some larger numbers let's take this one and say I'm going to take my count the actual count 220 divided by the total 25,000 and that gives us if we move the decimal two places over the point 888 this one is 207 divided by 25,000 gives us the if I move the decimal two places over 0.83% about and if I add up all the percentages they should add up to a hundred percent I didn't do the total down here but they should add up so then I can look at the differences between the the likelihood based on our norm.dist versus our actual data and that could further confirm whether or not we are tied into a bell curve type of situation then we have the z score now the z score is kind of like another representation or different representation other than the x meaning the amount around the mean so remember we're talking about if I look at a bell curve we're talking about the middle point would be in here and the z score would be zero if we're in the middle negative numbers down below positive numbers above numbers greater than or less than zero are going to be less normal normal would be in the middle so we're going to say alright how do I calculate the z score that's going to be each x in this case 80 minus the middle point hold on a sec 80 minus the middle point which is the mean so minus the 127.08 and then divide by the spread number the standard deviation 11.66 right there that gives us our negative 4.04 now notice that's that's unusual because it's this it's quite it's way far lower than the middle point and you can see now it's getting closer to normal because it's going towards zero it gets to zero around where our mean is one 127 to 128 and then it goes above so now we're getting less and less normal on the high end so in this case the high end representing more weight higher weight the low end is less weight so so usually with these normal distributions we want to be the normal is usually good right because you want to be kind of normal norm usually not but obviously when you're looking at weight then if you're going to be abnormal in terms of more healthy by more muscular or something like that then of course you might have be away from the the norm based on that and again you might look at different distributions of people who have different you know body mass more more muscle versus fat or whatever that could could take say what's on average for a particular type of athlete and so on but normal is usually kind of you know the baseline of course which is kind of good normally all right so then we could then ask of these kind of questions such as let's see this one p of x is greater than or equal to 133 so if we're looking at our our graph over here we're saying if I look at this graph this by the way is graphing is graphing this p of x and the actual data as a percent so that we can put them kind of on top of each other and we can see that the actual data lines up pretty closely to the curve in this case which is an indication that the curve would be have good predictive power so now we're looking we're trying to find another way to graph our our our curve and an area graph this time which might help us to use one graph to that could vary possibly between different scenarios so we have a question of something being above would be the blue part so I can kind of look at this line and if it was below I can look at this line and below and if it was between I can look at basically the between area so so here's our questions here this one this question is a question for us to have it greater than or equal to 133 so if 133 is here we're looking at the blue kind of above that point okay so if we visualize that we're going to say well then it's going to be above so I have to use one minus why because remember that I can only go from left to right and use the cumulative so I can't go from right to left so if I'm going if I'm looking at this blue area on this side I have to say I'm going to take the whole thing minus everything up to this point which will leave me this bit the whole thing being a hundred percent a hundred percent is what the area is under the curve 100 represented by one minus the norm dot dist of the x which is the 133 comma the mean and the standard deviation 127 11.66 comma cumulative yes it needs to be cumulative okay and then I could say well what if I do this in terms of z scores instead of x so let's convert that x to a z score that would be 133 minus the middle point which is the mean of 127.08 divided by 11.66 so the z score then is about 0.51 so if I knew the z score which is a little bit over you know it's over the middle point which would be zero higher than the middle point of that then I can use a similar formula which would be one minus norm dot s dot dis all I need is the z because the x mean and standard deviation have been kind of combined together when we calculated the z and it once again is cumulative we get to the same result of the 30.58 okay so then I can do this kind of question x is less than 109 so if I go over here and say okay well if x is less than I can look at these so notice if well let me graph this first let's graph this other one x is greater than 133 or p of z is greater than 0.51 then this is going to be if logic test and we're picking up this x and saying if it's greater than or equal to and we're picking up the 133 then comma what do we want you to do we want you to give us the p of x but if it's not then we want you to just put blank there which is double quotes text field with nothing in the middle we could have done the same thing with the z score of of point five one five right the z score or the x and then there's nothing in it until we get down to way down to here which is the 133 and if I graph this on top of what I had on this side then it's going to give me it's going to give me that that separation so now let's but let's go to this one we want x to be less than or equal to the one oh not oh nine so less than so if if we had 109 right here we're looking at the blue side in this case we want it to be less than now because the cumulative is up to and including that point we could just say up to and including that point remember that you might also say if it's 109 you might say well look 109 is right here why don't I just sum up from 109 down to here because I can do that easily in Excel but that'll be an approximation not exact because we're looking at the area under the curve you know so you to be exact you'd want the formula so we could do the formula over here and it'd be be looking like this it would be equal to the norm dot dist the x is the 109 the mean standard deviation 120 708 and 11.66 do we want it cumulative we do therefore a 1 and we get that 6.05 we can also take the z score so if we represent x as a z now the 109 I'm going to convert to a z by subtracting it minus the middle point of 12708 pounds divided by the standard D 11.66 and we get to that negative 1.55 then I can use my norm dot s dot dis to get to the same 6.05 using just the z and it being cumulative the z now taking into account has combined together in x essence the x that mean and the standard D all right so now let's take the middle question so now we're going to say well what if p of x x is less than the 133 and it's greater than the 109 so if I was to if I was to look at that over here and say okay well now now we have it less than less than I think if it was less than here and greater than here 109 133 it's the orange area so so that's no so how would we how would we do that well I can only I can only add it up up to a certain point with the cumulative so I would have to add the higher end up to here and then subtract out the cumulative of the blue area in order to back into the orange area so I'd have the whole thing up to here all of the orange and then minus this blue bit so notice that if I let's first think about how I graph this one this is x is less than or equal to 109 so that's going to be our if calculation up top so if we're going to say logic test this 80 is greater than or equal to the 109 then what do you want to do pick up the p of x if it's not comma what do you do put a blank thing so now you've got data from here down to here and then it's all blank so if I graph this so I can you can imagine I graph this on top so I have another another layer on top of my graph here and then this last one over here we're going to say that and also I have I have a dynamic headers that we talked about last time I won't go into that in detail here but this last one the middle bit is going to be the norm dot dist of the top part which is the which is going to be the higher x the 133 and then the mean standard deviation it needs to be cumulative minus the norm dot dist of the lower x the 109 the same mean standard deviation it also needs to be cumulative and then if we converted that to or this I can also think about it this way one minus the two the two results that we got up top in other words if I look at my data up top we did p of x is greater than 133 p of x is greater is less than or equal to 109 so if I look at those two and I think about this I guess I'm gonna say okay p of x was greater than the 109 and less or I'm sorry let me do that again what or my two p p of x is greater than equal to 133 and less than or equal to 109 so those are the two tails right those are the two ends greater than the 133 less than the 109 those are the two blue sides so if this whole thing adds up to 100 percent and then I have these two blue sides so I can subtract that out I can say a hundred percent minus those two in other words I'm gonna say this is gonna be the three oh point five eight plus the six point oh five minus 100 percent and I should get that middle part of the sixty three point three seven so that's another way to kind of envision that and then I could say okay the z scores so what if I had the z scores I can do this with the z calculations getting to the z's in the same way I won't you know it's gonna be the 133 minus the mean divided by the standard deviation will give us the z's and then you can do the z in the same way norm dot s dot dist with the z scores cumulative minus norm dot s dot dist with the z scores of the lower z score and you'll get to the same thing you can also do the same thing with the z scores that we got to to get to that same 63 37 so this is the probability of the lower and so before I before I do that just note that when I look at this graph then this last bit then is p of x is less than or equal to 133 and over the 109 we have our dynamic header that will show how to do an excel and one way I can do this now is instead of me redoing this whole graph based on the the information to the left I can say well look I'm gonna I'm gonna sum these two up because this is the upper and lower ends and I'm gonna say if so I want you to do an if and then I'm going to sum up these two if the sum of that is greater than zero then what do we want you to do we want you to put we want you to put nothing over here double quotes if not then we want you to put the the the number or then we want you to put the p of x so by doing that I can then get this middle part that's being graphed so now we're graphing that middle part so the point of this is that now we have this kind of one graph that might be able to help us and be dynamic to answer any of these questions we can kind of say okay the high end is over here and I can kind of use this graph to envision if I want to look at the tail on the high end I can use the same graph if I wanted to ask a question on the low end and then if I want to ask a question about the middle I can use the same graph to basically plot the middle point because by the way I'm entering it into the system here I can I can enter either I'm gonna I'm gonna say that this is always the you know the top and the low point and this is going to be the middle so we'll do that in excel if that's something of interest so to try to to try to get kind of one graph that'll be a little bit dynamic once you put it together so that you can visualize multiple of these kind of questions now another question that you that we haven't really looked at as much is like the probability of the lower end what if we know the probability of the lower end is 45% and I'm trying to then find the x value or the z value so in other words I know that this this probability of this end is I know that that area under the curve and I'm trying to find the point then of the x value or the z value this value on the x so if we ask a question like that then we can use a formula which would be norm dot inverse i and v and then the data input will be picking up the probability instead of the x value and then the same mean and standard deviation and that will give us then the x value same with the z the z value so if I know the 45 percent I'm now backing into the z value right so I kind of reversing the algebra but doing it with a formula so now we're going to solve for the inverse norm dot s dot inverse and now all I need is the probability which is the 45 percent because the mean and the standard deviation are kind of included in that z value calculation and that gives us our our z value of the point one three the point one three and one twenty five you'll note if I look at the graph so the point one three and the one 25 would be somewhere around here right they're like they're kind of the same they're at the same point or if I look at 125 over here and I look at the z value 125 and point uh one eight about right it's between because there's rounding involved so so there is that and then we're going to say okay what's the probability of the upper end so notice that these two are kind of the inverse so 45 and 55 add up to 100 what if we're looking what if I know then like the upper end like this blue area this isn't the exact number but you know the upper end then then how do I back into like the related the related x or z so we can say okay that's going to be another inverse but now you'd have to take the norm dot in verse same thing but the probability now is going to be 100 percent or one minus the probability here so one minus the 55 and then comma the mean and standard deviation and that will give you that'll give you the the uh x and then if you want to the z value same concept we would take the norm dot s dot inverse one minus the the 55 and then you get to to this value and they're the same numbers because of we chose the 45 percent and uh and the uh 55 percent so so obviously if you're saying the the probability of something that is the lower end 45 percent means that the upper end is the 55 percent which means you end up on the same x value uh which is going to be the 125 and uh the 61