 Statistics and Excel. Histograms and scatter plots with population data. Got data? Let's get stuck into it with statistics and Excel. You're not required to but if you have access to one note icon left hand side one note presentations 1065 histogram and scatter plots with population data tab we're also uploading transcripts to one note so you can use the immersive reader tool change the language if you so choose either read or listen to the transcripts and multiple languages using the timestamps to tie them out to the video presentations one note desktop version here continuing on with our theme of taking data making pictorial representations of that data so we can get a different angle a different view at the data hopefully that view allowing us to have a better understanding extract more meaning from the data than we otherwise would be able to do our major tools that we're going to be using will of course be our histogram that we'll be creating and we'll be creating the scatter plots remembering that a histogram will typically be useful with one data set so we're trying to know things about that data set such as the center point of the data and the spread of the data we get a sense of that with the histogram the scatter plot will typically have two sets of data now remember the histogram you could have two histograms like overlapping on each other and in one area and that's how you can you can have two sets of data on one graph but usually we're thinking about one histogram representing one set of data whereas the scatter plot is having two sets of data on an x and y axis and you're usually trying to see if there's a movement together and once we see if there's a statistical movement together then we can try to determine if there's a cause and effect relationship and which one is causing which one for example so we're going to look at this with the sketch with population data here so we have locations we've got the population total population and then we've got age less than 18 age 18 to 34 age 35 to 65 age 65 plus number of households families and income levels so this is you can imagine a census of data here and we've got this data and now we could start to plot this data so here's the information for it related to it here's the numbers related to it and then we're going to start just plotting the the items out now first i'm just going to go line by line and take a look at different histograms if we were to to make a histogram from each set of data so the first one is simply this set of data which is the population so in the population we have a histogram here now these are basically i'm using kind of the defaults of the histograms created from excel so i just select the item i let excel come up with a number of buckets here and this is the histogram that we have so these are the bucket sizes that they put them into the ranges that we have and then and then the the heights of the histogram how many fall into these different ranges now if you were to describe these types of histograms remember most people when they start thinking histograms they they we start to think that every everything is going to fall into like a bell curve because the bell curve is so it's such a widely used thing and we'll get we'll talk about bell curves they're very useful but note that that not all data sets are obviously going to fall into like a bell curve you could have any kind of of spread of data and so then you have to then we want to think about what's how can we kind of describe the data right we could we could basically say you know this one is skewed to the left or the right this one skewed to to the right meaning the tail end is to the right the fat part of of the data set that is on the left side so that's one way that we can kind of describe it we can look at the the focal point of the data set so if you were to put it on a scale where would the kind of focal point be and so it gives you a sense of where like the midpoint would be in the data set from an intuitive kind of standpoint which is another key thing that we would be looking for so in this case we've got populations around here the population sizes is of course where the most people are at and then you've got the population sizes going up a lot and you have less locations that have those large populations you would expect those to be of course the big central points and so you wouldn't think that there would be a lot of them that that's how usually the population for many places will be right there's a central place a big city uh type of place and then and then and that's going to have you know a much higher population possibly than other areas around it so then let's take a look at the next one this one i'm just taking the age now so now i'm just saying well let's just look at people that are uh less than 18 so we're going to say uh these would be basically children for the most part or not at a age of adulthood possibly so you would expect then if people are less than 18 you might have more families in that case right because well you've got more possibly kids or young people in that time so again you could try to draw conclusions as to whether there'd be more or less uh people of a younger age depending on the circumstances of a particular you know location is there some kind of boom or something with children or more families or less families or whatever you know so uh you know so so we have the the numbers here once again it's skewed to the right which means that we have this kind of tail that's happening uh towards the right side so again you got your buckets over here and then and then you have the highest point actually at the far uh left side and then you've got these kind of outliers uh towards the right side which have a a whole bunch of uh 18 year olds in it now again if you if you start to kind of think about these two together you might start to think well your population if the population goes up in general you might have you know more 18 year olds you know in general and so on if you've tried to kind of combine these things together uh let's take a look at the next one the next data is just 18 to 34 so we just took this data then and and made a histogram of that one data set and you have a you have a kind of a similar look to it and that and that it's skewed to the right and you might think of a in the general sense any age bracket as the population goes up you would think that any age bracket may go up at the same time although the the relationships could differ of course depending on you know where people where people choose to live so you have once again the this bucket being the largest number of areas and then it's going down and then with the most aged people or the most people between the age of 18 and 34 uh over here and so this one's a little interesting that this one you know this bit is lower than that last one it's kind of interesting but in any case then you've got the age 65 plus and now you've got the older uh older folk here and so again you have a similar histogram but they're not all you know the same looking right compared to the total population so we had the total population here and then 18 under 18 18 to 34 and then age uh 65 plus but you have a similar kind of trend where you have the the the lower number of people that are age 65 plus uh more of them more of those locations here and then and then more people that are 65 plus you've got a you know the tail going out to the right and then you've got the number of households so this would also be a similar trend you would think because if you had a number of populations a higher populations you would expect more households but you could start to dig into numbers in terms of you know are there single households or or family units here's families so so again as population size goes up you would expect there to be more families but then you might expect expect there not to be an exact you know difference or change between you know a city and and less populated areas pop possibly and then you've got the income number of household 100 000 plus so so this is a a little bit different of a distribution again you would expect kind of that the more people in an in an area you might be more likely to have people that make more money within it so on the lower bucket you have you know 120 to 750 so less people making over the 100k and then of course if if you move on to the bucket for more people in a particular place are making that amount of money there there's there's less locations that have more people that are making over that amount and then if you have a whole lot 2600 and 40 to 3270 that are making you know a lot of of money then there's few places that have that many within it right so I guess so you would think that population size would have some kind of influence possibly on on these statistics right when you start to compare some of them so let's say let's do some comparisons now let's do our our x and y so remember when we make the scatter plot then usually we like we want to put the independent variable on the x so what we think is going to be the independent now remember when we're making the scatter plot we're really just trying to kind of graph uh where you know you might be thinking of it as a hypothesis when you're making the scatter plot right you're trying to see if there's a relationship that is statistical are they happening towards the same time and then once you've made that relationship you can kind of then you know try to determine what is is there a causal relationship and what's the chicken and egg what who's what's the cause and the effect right so for example so this one if I say population I assume the population on the x is the independent variable and what we might assume as our hypothesis is that as high as population goes up we would think that the the the people age less than 18 in essence children many most of them a lot maybe 18 or but our would go up as well possibly right you think the total population goes up their be more uh children and you see that there's a scatter that's pretty close now if I was to take this data you know and and excel from this data it's a little bit more difficult because I didn't put I didn't take them and put them next to each other so I can select these two columns of the same chart uh uh in this case these two well they are right next to each other in this case but they won't be all the time so I could select these two uh these two items here and and that'll give us the x and y if the column on the left uh is is going to be your independent variable excel will basically create the graph kind of automatically so you would think that as the population goes up you you have more people that end up that are that are also you also have more people that are age less than 18 you have more children as populations go up you have you tend to have more uh children as well and you would think that would be uh kind of expected but not always the case this one seems to be a fairly extreme outlier you got a fairly high population and fairly and very low people under 18 now you could again now you could start to analyze those types of things those types of outliers for example say well what is going on here why why maybe it's a retirement place it's a place where people go to retire maybe there there was a you know what what happened there kind of thing and then we can make the regression line within the data so now we're going to try to draw a line that approximates the data obviously the line is not exact the more the data is around the line then the more predictive power the line has and so so that and if we can get a formula for the line then we might be able to use the formula in some cases to make some predictions about the relationship we could say well if the population is going up we would expect you know so many people age uh less than 18 right what would be the general idea but it's not a perfect relationship we're just making a guess with that kind of line we could make other lines remember what you're trying to do is is draw a line or a curve a curve hopefully that you can represent with a mathematical formula which the a line would be the easiest to do uh so that you can then you know then you have a mathematical formula for for it for the relationship now notice this one we have the population and and the income level now this one we don't seem to have the population is going up and then we've got so so if I look at the two data points we've got the total population and then number of households over I think that's 100k it is the data point that we're looking at here so now we're saying you know as the population goes up you would expect that you might have more people over that threshold but you have a much weaker uh correlation here because the dots are basically all over the place so maybe you have you know uh more you know spots of people where there's you know well a lot of well-off people in different spots or something like that so you can still notice you can still make a regression line within here but that regression line has a lot far less predictive power than if if the data looks something like this it's it's quite likely that if you use this formula to make a prediction about population size and the number of households over 100k that you that it's not going to be quite right because this formula is not exactly representative of a data the data set that we have here so then if I take a look at another one this is going to be the population size and families so now we're if I look at my data set over here we've got the population size and we're once again saying that that is the uh independent and then the number of families so if I was to do this in excel I would have to highlight this hold control and select the non-adjacent cells at the same time because population is on the left it would automatically basically create what I would want meaning the independent on the x population the families on the y so you would expect there to be somewhat of a relationship here as well where population if population goes up then you would expect there to be more families right you expect if population goes up you'd have uh more families now it might not you know you might have sometimes where population goes up and there's less families you know per person per capita or something because you know more people in cities might might be single versus other locations or something like that but you expect in general there be an increased line like notice this one for example an outlier uh you'd say you know this one's fairly decent size uh population and it's lower on the families so you might say well what what exactly is going on there why exactly might that be the case but you have a pretty good uh pretty good relationship which means that the regression line formula might give you some predictive power you could say well if there's a population of so much then we would expect maybe there's going to be so many families involved based on this line and it has fairly decent predictive power given the fact that the dots are all kind of uh groups together here now note that you could also like what if you're thinking well I don't I think you have a chicken and the egg problem things are backwards I think the families are the thing that drive the population right because the families are the ones that have the kids and whatnot so maybe you know that might that might be a wrong assumption because you know you might have population from immigration and other kind of things as well but that might be we might say hey look these this is backwards uh so you could read this you could just say well I why don't I just read this as the independent and I'd come to the kind of the same conclusion as the families go up population increases you can read it that way uh because it's just it's showing the relationship between these two there seems to be a relationship between them but uh but traditionally we put the independent on the egg so now we're basically indicating well it's the families that are so you can see these are kind of mirror mirroring each other here right you could say you could do it this way and you could say well the families are the thing that are are the driving are the independent variable and as families go up then you know the population uh then the population goes up and so you can you can graph it this way and and and then again you can have this argument over what I see there's a correlation I hear I see that they're going up at the same time then the question is is one causing the other or are they just both going up because like a third factor is causing both of them and which one is causing the other one to go up is it is it the family is causing the population or the population causing the family and you can debate that but the statistics is showing the relationship and then you can kind of draw your conclusions or make your hypotheses from that but simply the fact that there is a relationship would mean that if we're if we use these the formula we might be able to make some predictive power either way right if we know the families if we know the number of families we might be able to to get an idea of the population and if we do it this way and if we know the population we might be able to get an idea of the number of families because there's there seem to be going in the same direction with fairly decent predictive power of the formula but then you get and then you get into the question of which is causing the other one is it a third thing that's causing or you know which came first and whatnot and those are different questions right so it depends what you're trying to do with the data now notice from a logistical side here if you were to make this graph now we've got the total population and the families if you were to highlight this data in excel highlighting this column and then holding control and highlighting this column and then inserting simply just inserting the graph it's not going to come out right because it's going to by default put the one that was on the left it'll do this population and then families so then you'll have to go into this data and you'll have to click on the data and reverse the axes making and reverse it right so so the easiest way to kind of build these charts is to make is to organize your data so the x axis the independent variables on the left and the y axis the dependent is on the right but if you don't want to do that and you're pulling it from data where the data is reversed then you're going to have to go in there and do a little bit more work so that you can so that you reverse the x and y axes in excel now obviously we could we could do a lot more we could compare and contrast you know you can see all the combinations we can basically do in here we can we can make a scatter plot between any of these and we can make any of them on the x and y axis and look at the relationships between them so however many combos there are maybe some of them are more useful than others of them but but you could see we could you know plot any any two columns together and a scatter plot and see if there's see if they seem to line up together