 statistics and excel bell curve test score data example got data let's get stuck into it with statistics and excel although we'll be using one note this time but we'll be talking about excel first a word from our sponsor yeah actually we're sponsoring ourselves on this one because apparently the merchandisers they don't want to be seen with us but but that's okay whatever because our merchandise is is better than their stupid stuff anyways like our trust me i'm an accountant product line yeah it's paramount that you let people know that you're an accountant because apparently we're among the only ones equipped with the number crunching skills to answer society's current deep complex and nuanced questions if you would like a commercial free experience consider subscribing to our website at accounting instruction dot com or accounting instruction dot think of it dot com you're not required to but if you have access to one note we're in the icon left hand side one note presentations 1016 bell curve test score example tab we're also uploading transcripts to one note so that you can go into the view tab take a look at the immersive reader tool change the language if you so choose and then either read or listen to the transcript in multiple different languages tying into the video presentations with the time stamps one note desktop version here in prior presentations we thought about how we can represent different data sets using both mathematical calculations such as the average or mean mode quartiles median and so on and with pictorial representations such as a box and whiskers or histogram the histogram being the primary tool we use to envision the spread of the data and we can use descriptive terms about the spread of the data on the histogram such as the data is skewed to the left the data is skewed to the right we then thought about certain curves and lines that we might be able to represent using formulas that could give us an approximation of the actual data whenever we could do that it would be a great thing to do because that gives us more predictive power generally into the future over whatever the data is representing because we have now a formula for it so some of those curves and lines we've talked about in the past are a uniform distribution binomial distribution the poisson distribution the exponential distribution now we're looking at the most famous one of them all the normal distribution or the bell curve so the bell curve like all of the other types of distributions is is only going to be useful in those cases when we're looking at certain types of data to that conform to a distribution like a bell curve so we have the same kind of process we might go through such as looking at the data testing it to see whether or not it looks like it conforms approximately to a bell curve if it does then we might be able to use an actual bell curve to give us that kind of predictive power over the data now many things of course do align to like a bell curve type of shape the bell curve i think one of the original ways that the bell curve came about was actually to test errors in approximation so when they were trying to see things in the stars for example and see where a star is going to pop up or a planet's going to pop up or something like that they would try to make mathematical representations about it and of course they would all be off but they would have a tendency if you average them together to be off in a normal distribution around the correct answer which is quite interesting and it does give a lot of weight to the idea of how you might structure certain types of things so if you're trying to find something in the ocean or something like that for example instead of getting in a room and having everybody kind of come to an agreement you might have everybody actually do their own calculations and and then try to figure out where the thing is in the ocean and then you put all those calculations together and average them that might be a more more uh better way to find an answer in some cases so it's kind of interesting but a lot of things in nature also kind of conform to a bell curve so whenever we're talking about heights of things like humans or other or or animals or trees or things like that you would think that they would basically be conforming to a bell curve they most of them would be around the middle and they would have a bell curve distribution out you could do that for heights in general for heights of males and females you can do that for weights for example you can do that for things like calorie intake you would think that we would hover around some midpoint and go above and below that from time to time but not too far because otherwise we would get really big or really small you would think so there's many many things where a bell curve might approximate the data however it does not always approximate any data set which is sometimes a a misconception that people have they think that everything kind of should conform to the bell curve and that's not necessarily the case we run into the same issues we saw with the prior curves the data set could be doing anything it could be doing something very chaotic and we don't have any curve that can approximate the data so so what we want to do is determine if the data can be approximated if it can then apply it now oftentimes test scores of course might be something that conform to a bell curve because you have a similar kind of thing if you have a bunch of people that are somewhere in the same area the same grade or something like that or they're in the same level of education you would expect that the the results that they have might conform somewhat to a bell curve type of distribution and that's of course also one of the most common examples both for instructors and students as there are always kind of analyzing the grades so let's imagine we have this information for grades now note that in excel you could generate your own data in real life we're imagining that the instructor has the data has compiled the data and is now looking at the data to see if it conforms to a bell curve in practice you might say how could i get this data can i make up this data in some way you can't really just use a random number generator because it's not going to be completely random it's going to be in accordance to a bell curve kind of distribution there's a tool in excel that allows you to do that though and it's in the it's in the bar up top under the data tab so it's under data and then you're looking in the analysis and it's data analysis now we turn this on when we work this in excel so if you want to work the excel problem you can check this out if you don't have this analysis tab in excel then you can go into the options the file tab and the options and turn on the data analysis which again we do in the excel practice problem but great tool to be able to practice with these bell curves and some of the other distributions so that you can generate your own data that you can practice with which is great so in order to generate that data we need to know the mean and the standard deviation in real life we might not know the mean or standard deviation if we were the instructor we would just be picking up the data and then running the analysis on that information so here's our our random generated test scores now note that the test scores I purposely made them not in the format of a decimal or a percentage but rather in the format of 90 0.97 representing a 90 percent right so and that makes it a little bit easier sometimes when we do the the norm dot dist calculation so in other words if you're working with a data set that is represented in decimals or percentages you might want to multiply it times a hundred so that you end up with a whole number of representation because then you'll have percentages when you do the norm dot dis so just to point that out that's what we did here so the 76.26 represents the 76 percent and so on and so forth so here is all of our data so then based on that data we're imagining in in our scenario that we're an instructor that has collected this data over multiple years or something like that we can take the mean or average of that data and that would of course in excel be the trustee average function equals average of all of this data we just sum up or take the average of it and it comes out to 74.29 now you might say why isn't the average exactly 75 because i used 75 to generate the data in excel and that's because there's randomness involved in it so it's going to come around something close to that midpoint of 75 but it's got the randomness so it's not exactly 75 it's 74.92 and then the standard deviation which would be the standard deviation here of the population is what we're using at this point in time of these numbers gives us 10.09 which once again is not exactly the 10 which we had to use to generate the data because of that randomness involved in it and then we have the median the one in the middle and that's the equals the median of this data now note that the mean and the median are pretty close together that is one indication that this might conform to a normal distribution or bell curve you also might calculate the mode which is the number that that has multiple times the same number comes up and if that is also similar to the mean or median it would be another indication that you're close you might be i have a bell curve distribution now in this case the mode might not work exactly because notice we're not representing the grades as just like a 72 or a 90 we've got decimals so the fact that we have decimals means it's a lot less likely that we're going to have multiple numbers that are exactly the same however if we did not have decimals if we rounded this to the whole number then it would be quite likely that the mode would be a useful tool and it would probably be similar to the mean something like a 74 all right and then we've got the now if i want to plot this the next thing we're going to do is say okay it kind of looks like this data conforms to a bell curve due to the mean and the median being the same we could make a a curve from it to further test that so we could so if i make a histogram of this data it looks like this now the histogram is where we have the buckets down below so this is going from 40 to 43 63 and the middle point is in here remember the mean is like a 74 so this is going from 71 to 75 and so and so you can see it kind of looks somewhat like a bell curve notice that it's not going to be exactly like a bell curve because we don't have a whole lot of data i can't remember exactly how much how many data points we put here but the more data points the closer you would think that it would conform to a bell curve if you're using something that doesn't have as many data points you would expect it to be a little bit more jagged also it will be impacted by the bucket sizes that we're going to use down below which we talked about in prior presentations but we're saying hey look the the median the mode look like they're pretty close and uh and it looks like it's kind of conforming to a bell curve and we have an intuition that this data might conform to a bell curve so we might then want to graph this thing as a bell curve so we're going to choose the x's and we're going to choose the p of x's so we're going to then plot this thing out plot our data points so that we can then create a chart from this information so then the question are the x's when i start my x's i can think about this and say well look this is grades so i would think it would be going from zero uh up to 100 100 representing 100 percent zero representing zero one representing one percent so i could do it that way i could just go from zero to 100 but i might want to in other examples i might say i might not have that convenience i might say well where where should i start my beginning and ending x's when i want to plot my graph and then it's going to show up on the x axis well we know that in a bell curve that the vast majority of the data will be within three or four the vast vast majority of the data will be in four standard deviations so we could say let's take four standard deviation both above and below and that should be encompassing all of the data so for example the standard deviation is 10.09 if i take that times four there's four standard deviations if i'm starting at 74.92 the middle point minus that 74.92 then i can say that the lower x should be uh should be 34 so i can really just go down to 34 and still be picking up all the data i don't have to go down to zero in other words because it's unlikely that you're going to have test scores all the way down to zero that would be quite badly performed test people could guess and you probably do better than that if i did this again 10.09 times four and then go above the mean four standard deviations above the mean plus 74.92 then i get to 115 now you might be saying 115 doesn't make any sense because it's over 100 but you know in some cases you might have such like extra credit or something but in the bell curve scenario note that you can go on forever when you look at the theoretical concept of a bell curve so to so it might be useful even though in practice it's not going to go over 100 to plot it out to 115 so i can see the entire bell curve tapering off as bell curves do and then i can also see like the total adding up to 100 percent let me show you what i mean we've got then the x's so here's our x's going from 34 on down and then we're going to do our p of x calculation to do this we're going to use the norm dot dist function within excel so this is going to give this is going to be kind of similar to some of the poisson distribution for example or binomial distribution we talked about in the past but now we got the norm dot dist the x is going to be 34 in this case representing that grade and then so what's the what's the likelihood of getting a 34 for talking about the bell curve with the mean and standard deviation defined over here so the mean is going to be the 74 and then comma the standard deviation is going to be the 10.09 and then is it cumulative or not this cumulative is similar to what we saw with the poisson distribution for example although here we're we're representing the area under the curve so so it's so it's a little bit different in terms of it's kind of like calculus involved the area under the curve but the concept is basically the same do we want the cumulative up to that point or do we want just that point in this case we want just that point so it's false or zero and then we get to around zero percent so if we take this down we're going to say okay what's the likelihood of getting a 46 we're going to say 0.07 percent right so that's the questions that we might ask with this now most of the time if you're a student for example you're probably going to be asking what's the likelihood that I get like a 70 or above that kind of question which you might say well I can go down to 70 down here and then add up everything from 70 on up but you can't usually do that exactly because again although you might have been able to do that with a poisson distribution here we're talking about the area under the curve that might give you an approximation but you'd want to use a formula with the cumulative formula which we'll talk about a little bit in a little bit here but there's our p of x so that'll give us our approximated curve so if I was to look at it then it would look something like this here's our approximated our actual curve that approximates the actual data and this is the blue representing the the curve and the orange representing our actual data so now we're plotting the data on top and you can see it kind of lines up I'll show you how we did that in a second here so then just to note here we can say the frequency now I might want to compare my p of x information to my actual data and I have a problem with that because my actual data over here is being represented is actually outputs my whole my whole numbers over here and my p of x is in the likelihood of something of a 34 so what I could do is try to convert my p of x into a number by multiplying times the number of samples that we have or we can try to take our actual data and make it into a percent which is probably the more common thing so what I want to do to compare my actual data is I want to look at my actual data over here so there's my actual data I want to I want to count how many times each of these data fall into the buckets now note that these numbers aren't exact because I'm looking at for example on this 35 how much how many data sets in our actual data are within or above set 34 up to and including 35 and then how many items in the data set count them if they're above 35 up to and including 36 that's going to be our frequency calculation which looks like this frequency the data array would just be highlighting all of our data saying that's what I want you to count basically and then the bin array being all these information which are the bins that we want you to put those counts into so if we said this we how many times did you count 34 how many times did you count above 34 up to and including 35 zero scrolling down we're going to say how many times did you count up to and including or over 40 up to and including 41 to two times that that happened in our actual data how many times did you go above 50 up to and including 51 how many of those were in our data set three of them and so on and so forth and if we go all the way down we can see the data if I add up all of that data it adds up to a thousand that should be the actual number of grades in our sample data that's how many sample data points that we generated that's a nice check number that we have to make sure that we picked all of them up and put them in an appropriate bucket now what the problem here is that I can't compare that 10 to what I got in my norm dot dist because the norm dot dist is 0.79 I could multiply the 0.79 times the 1000 sample right I could say well I had a sample of 1000 I would predict based on the norm dot dist times 0.79 that it would be 1000 1000 times 0.0079 would be 7.9 right I could do it that way I can multiply all these times 1000 to compare it to these actual counts or I can take my actual counts and divide it by the total which is what we'll do here so now I'm going to say all right let's take like this if I take this 14 and divide it by a thousand 14 divided by a thousand then I get if I move the decimal two places over 1.4 now that 1.4 is in a percentage format so I can now compare it to this percentage format so I'm just doing that all the way down so this 19 is 19 divided by the count of a thousand so move the decimal two places over 1.9 percent which is pretty close to the 1.97 that we got with the norm dot dist function so then I can compare these I can look at the differences this is what I got with the function this is my actual percent data of the total and you can see we're pretty close on the data so that would that would give us what you know it's somewhat close that would give us another indication that are how far away from a normal distribution we are and then here we got the z score now the z score is another way to represent our data and it's trying to say I'm going to represent the data in terms of how close it is to the middle point which is the mean of our normal distribution so if I look at my my information over here the middle point on our graph is going to be 74.92 standard deviation 10.09 so what I what I could do is start to say well if I look at my graph and I say well this is you know this is the middle point I could start to measure how far away above and below that middle point the normal is how far away are you from the normal would be the z score right that the lower the z score the closer to zero the z score then the closer you are to normal the mean the further away you are in a positive sense the higher you are from the z from the middle and on a negative the lower you are so over here how do we calculate that we're going to say that that we're going to take each point in this case 34 34 minus the mean which was 74 minus this number 74.92 and then I divide that by the standard deviation the standard deviation so divided by the 10.09 and that gives us our in this case negative 4.06 so that's going to be our z score and I can look at all these and each of these data points then remember oftentimes when we're looking at comparing something like job performance or performance in schools sports performances and stuff then we have to represent things in percentages a percentage we don't we don't we're not saying how many how many we got correct or incorrect we're taking a percent of the total which is even more useful if you if you're talking about things that are uneven like batting averages or job performance because you had a different number of tries at something that's why the percent is a useful tool in and of itself to make more comparability and then we can also compare kind of with the z score which is going to give us our how close we are to that middle point or middle when we're talking about a normal or bell curve distribution represents what's normal the average what normal people do so notice that the bell curve is kind of interesting too because before that you got to think that people might not have had as tuned a sense as to what normal is like the bell curve almost seems like a kind of defined normal so people that are too tall used to be used to be thought of as kind of weird now they're kind of idolized for being very tall and that's always been the case I guess to some degree but too tall you'd be saying that's abnormal abnormal was bad well how do you know it's abnormal because we have a bell curve and it's on these types of thing it too small is abnormal so now it's weird it's outside the it's outside the range so you got to think before the bell curve did people really think in terms of how abnormal something is or something like that but in any case the middle point is around 74 so you can see that z scores around zero and then the z's are positive going above it from that point okay so that's the z score so we have those questions so now we can kind of ask questions such as we can we could say okay well what if a test score is equal to 90 and the operator is less than or equal to so if we had something like this p of x is less than or equal to uh the 90 how would we get you know that calculation because remember if I look over here we're going to say okay well 90 if I go down to 90 I can see that the likelihood of me getting 90 would be 1.92 but that's not usually what I want I want usually greater than or equal to 90 or less than or equal to 90 possibly but either way if it was greater than if I what's the likelihood I get above 90 I could you think well I could add these up I can add all this up down to here but again you can't do that exactly that'll give you an approximation because we're talking about the area under the curve so you could do that it'll give you an approximation but it's not going to be exact to get an exact number this is less than so this would be the the likelihood of something less than or equal to 90 would be equal to norm dot dist the x is going to be up top the mean 74.92 standard deviation 10.09 and then the the x is 90 and the cumulative and then the cumulative bit do we want it to be cumulative we do so that means it's going to add everything up which gives us the likelihood of 93.24 this isn't the test score that we're looking at the likelihood that we have something at a 90 or below the likelihood of getting a 90 or below based on the bell curve is 93.24 percent in other words we would expect 93.24 percent of participants to score 90 or less on the test and then if I looked at the z score I could say okay well what about the z score then where do they line up with the z score well we've got 90 minus the mean which is 74.92 which is 15 divided by the standard deviation 10.09 and that gives us a z score of 1.49 remember that zero is normal zero would be at the 74 so so remember that normally you would think with test well average should be 70 right wasn't that like but but but then you got to think about well what are averages on this particular test right in this in this case the average score the middle point is 74 around 75 so if you get a 90 so 74 75 would be zero in terms of z scores if you get a 90 you're clearly well above that and so we're at the 1.49 in terms of z scores and then if we look at this one we can calculate this same thing the probability of x being below or equal to 90 not with x this time but with the z scores so the z scores and is kind of like another is another way that we can we can represent things by x or the z score right and if I do it with a z score the formula would just be equals norm dot s dot this and there's only two things we need to do the z and cumulative we only need the z because notice that the z score itself calculated and included the x the mean and the standard deviation so all we need is the is the z and then one for cumulative to get to that same answer if we had the z score instead of x and we wanted the likelihood of getting less than or equal to the 90 so then if we if we want a question like greater than or equal to 90 then it would be a similar thing over here we can say okay well if I go over here now I'm looking at 90 which has a 1.29 percent likelihood and then you would sum everything up down to here but that will not be exact but you can do that as an approximation to be more exact we can use a formula one you could do it this way you could say well if to get less than 90 is that so I can say 100 percent minus 93.24 gives us 6.76 in other words if I'm going to have less than or equal to or greater than or equal to 90 is 93 percent and it has to add up to 100 then then the likelihood of the other side greater than or equal to 90 is is only the 6.76 percent or you can do it with our calculation so now we have to say it's the same thing norm.dist but now we're taking one representing 100 percent minus the norm.dist of the x 90 percent the mean that we saw up top same standard deviation and we want it to be cumulative so that's doing the same thing we did up here one minus the answer we got before we're just doing it in one cell now one or 100 percent minus the answer we got before and then we can do the z score same thing so we have our z score same z but now we're asking getting over the 90 or that z score same concept one minus what we got the norm dot s dot dist so we we can do the same thing just like we did up top with either x's or z's whatever the case may be and then we could say what if we're doing in between so the next question you would ask is well what if I what's my likelihood of getting somewhere between a 75 and what did I say a 90 I can live I can live with that I won't die at least I won't die if I do that I probably won't die if I get below that anyways but I might there's a chance that it'll kill me so you could sum those up but again it won't be exact because because it's the area so we can be more exact like this we can write this a couple different ways p of x I like is is so now you have x is greater than or equal to 70 and less than or equal to 80 I kind of like writing it this way which is probably not the most efficient way because I like to have the x on the left so x is greater than or equal to 75 and less than or equal to the 90 but in any case the formula would look like this it's a bit uglier because we have to use that cumulative thing as we saw in the poisson distributions we want to take the cumulative on the high side so if I look at my picture over here and the pictures are nice to use we're going to show how to make these in excel so if I'm if I'm trying to find you know the the if I'm trying to find like this side then I could take the whole thing I could take the whole thing minus so if I'm trying to find the blue area I could take the whole thing minus the cumulative minus the orange area will give us the the blue area 100 percent minus the orange area if I'm trying to find something in the middle, then I can take the cumulative up to the top point, and then take the cumulative up to would be would be the bottom point, and you would be left with the middle right. And so that's what we would do here. So so that's what this formula is doing norm dot dist of the higher x, the 90, the mean standard deviation, it needs to be cumulative. That's what the one is minus the norm dot dist of the lower bit, the 75 will give us the bit in the middle 42.93%. Remembering this does not mean that we're talking about you're going to get a 43% on the test. It means you have a the of all the people taking the tests, we expect around 43% to have scores between 75 and 90. So or equal to include 75 and 90. We could do that with the z scores as well. So if I was to calculate the 75 z score, so I can say I can say, Okay, what are we doing with the z scores? The z score if I'm talking the 75 75 minus the minus the 74 minus 74.94. That's the distance from the middle point divided by the standard deviation 1010.09 is going to give us that that point 00. That didn't it didn't work. It's 75. I think I said 7075 minus 74.92 divided by 10.09 gives us that around 0.008 if I move the decimal two places over and this z score is 90 minus the 74.92 divided by the standard deviation of 10.09. And that gives us the 1.494. So I can do the same concept. But instead of using X is now using the Z's, which would be norm dot s dot dist. Then all I need is the Z because I already have kind of the X. I'm sorry, the mean and the standard deviation, the X, the mean, the standard are all compacted into that Z cumulative. That's for the higher bit, the 1.494 minus the norm dot s dot dist for the lower Z. And that'll give us the same 42.93 the part in the middle. So there's our there's our Z's and we can actually plot this over here. So if we want to make our graph, notice that I made these with bar graphs. And this is an actual area graph now. So we did this with the area. And then the question is, Well, how can I draw these graphs, which is quite useful, especially if you're like me, and you're really not good at drawing these grasses kind of really hold held me back learning math in school because I would get all messed up on how many numbers should be on the X and the Y. And so if you can make these in Excel, then it's great. But it's still a little bit complicated to do. So this so we do this in Excel, so you can you can plot this out. We're doing this with an if logical test. So if we're saying that if this X here, if this X is greater, so we're right here, we're looking for less than or equal to 90 X is greater than or equal to 90. So we're taking this X and we're saying if it is less than or equal to, I believe the 90 is what we're picking up in our formula over here. And we'll do this. I highly recommend taking a look at it in Excel. If you're interested in this, there's the 90. And then we're saying that if that is true, then we want you to give me the result, which is this P of X. And if it's not true, we want you to just give me a blank cell, which is represented by the double quotes, because whenever you have text, it's with the quotes. So you can see it's got stuff in it up to here, which stops at the 90. And now you have blank stuff. So if I graph this on top of my other chart here, then that's where that's where we get this line, which gives us that that nice representation. So we can try to understand a little bit more pictorially what is happening. Notice that this graph is really pretty neat as well, in that we also put the X is here, as well as the Z scores down here. So now we can look at two X axis is because remember that we talked about the idea that we can represent a lot of this stuff in terms of talking about it in X's, which are the grades right and talking about it in the Z scores, which is the distance from that middle point. So we can actually put two of these in here, which is a little fancy tricky thing to do in Excel. So you want to check that out in Excel so you can you can utilize that tool as well. Also note that you might be able to get away with like one of these graphs, because remember, there's basically three questions we ask. Oftentimes there's more than three, but one more of the common ones would be, what is above? So we asked, what is everything above a certain point like the 90? So that would be the area of the blue side. And then we ask, well, what's everything below a certain thing? And notice that if you use one graph, you can kind of ask both those questions, because if I'm asking what is above a certain point, that's the blue side. That also means that because it adds up to 100%, the orange represents the question of what if something is summed up up to and including everything below a certain point. And then the next question often asked is the one that's in the middle. And that's the one you can't you might have to do a little bit more fancy graph. So we have a whole bunch of different ways that we can make these graphs that will demonstrate in Excel and try to explain them a little bit here in the one notes. Also note with this graph, we when we graphed this, this is the actual bell curve in a bar chart in our actual data. This is the this is the bell curve in the actual data. And when we graph those, we graph this one on top of the percent of total column to get this one. So those are the that's the general idea. Again, I highly recommend checking this stuff out in Excel.