 Hello everyone. Today we're going to be talking about descriptive statistics. So descriptive statistics is a numerical and graphical way to describe and display your data. Basically we have all of these different data points so once we actually collect the information that we're interested in we have these data points and we want to try to make sense of the data points in some way. We want to describe the data that we have without actually doing you know in-depth analysis of that data. So the easiest way that we have to actually describe this data is by showing it in some sort of graph showing some feature of the data that we have and we can we can pull out quite a bit of information this way. Descriptive statistics is used for many many different things and I'll give you some examples of that. So one of the easiest forms of descriptive statistics are stem and leaf graphs. This is a very good choice whenever the datasets are small and they can give you you know insights into the data that you have. You can do some comparative studies with them using stem and leaf graphs. The leaf consists of a final significant digit and here is an example. So in this case we have scores probably on an exam or something like that and the stem is the most significant digit here so the stem here we have 3 4 5 and that represents 30 40 50 actually and then we have a leaf and that is the least significant digit so in this case the stem and leaf together the first sample is actually 33 the next sample is 42 49 49 53 55 and so on so what does this stem and leaf graph actually tell us how does this stem and leaf graph help us well think about if we had all of these numbers in a row we just wrote all of these numbers down and we put all of these numbers in a table with no particular ordering it would be very difficult to try to pull out patterns in the data and statistics is basically all about patterns right we want to identify patterns within our data so in this case we have stem and leaf graphs and we can already see some patterns so one pattern that I see is that a large number of people actually did very well on the exam they got 90 or above and normally people are around the 70 or 60 range but it seems like a lot of people got a 90 or above we kind of have a lot of people getting essentially a's and then quite a few people getting B's and then less people getting C's and I guess a lot of people getting I don't know D's and then a few people getting lower than that so in this case I can already see some patterns here and I already know that the distribution is not exactly normal it's not an exactly a normal distribution we'll talk about normal distributions later but what I want you to see here is if we just had all of those numbers written out we wouldn't be able to pull out any additional information and even without looking at each sample or the value of each variable I can already see some patterns here I can see that's like more than average people are doing good on the exam kind of we have a we have a normal distribution about in the middle and then we have kind of a lot more people getting around 40 for some reason so it's not it's not a normal curve for grades and I can tell all of that without actually looking at the individual data points as soon as I see them all together and ordered I can see a pattern in the data already all right so that's a good way to kind of get a summary of the data that you have and once you have a summary of your data then you can start to ask questions about why are so many people getting you know 90 or above or why are people getting 40 to 49 49 like what's what's happening here another example are stem and leaf side by side so in this case we we have we've measured essentially ages and we have the ages at inauguration for American presidents and we have the ages at death for American presidents and then the non-labeled column is again our most significant digits so here the four five six seven would be 40 50 60 70 80 90 years old okay so what we can see is that the majority of presidents were inaugurated in their 50s actually a very very large number were inaugurated in their 50s and a very large number of inaugurated presidents died in their 60s and 70s basically so we can we can already start to see some patterns here and one thing that I notice is that the the ages at death are actually on average lower than the national average so I believe the national average in the US is around 80 years old so why are presidents dying so much earlier at a disproportionate rate and then you can start to make hypotheses like you know being a president carries a lot of stress and because they're already you know in their 50s a lot of stress reduces their lifespan things like that we can start to ask questions about why these patterns exist so here it's a very easy way using stem and leaf side by side to compare two data sets and in this case we're measuring the same thing so we can make comparisons of essentially two groups or let's say attributes of us the same person basically okay next is a line graph and I'm sure you've seen line graphs before in line graphs the x-axis or horizontal axis consists of data values and the y-axis the vertical axis consists of frequency points okay so x-axis and y-axis basically have some value and y-axis has some sort of frequency point usually some sort of measurement and we'll talk more about this in a second bar graphs consist of bars that are separated from each other very very useful for showing you know different groups or different measurements I mean you can use it to measure almost almost anything really so here we have actually ages and we have age ranges so 13 to 25 26 to 44 and 45 to 64 and then we have proportion we're not really sure what this proportion means because we don't have the key but let's say it's the proportion of you know people that we sampled in a survey and if it's the proportion of people that we sampled in the survey then 13 to 25 is 45 percent of the proportion of the overall proportion of people we sampled 26 to 44 is a little bit above 35 percent and 45 to 64 is a little bit below 20 percent so what this immediately tells me I mean imagine that this could be you know hundreds of samples or hundreds of data points but I can very clearly see using a bar graph that's you know the 13 to 25 year olds are a disproportionate number of this sample they're probably over represented depending on what the survey is about and older people 45 to 64 are not equally represented in the sample so this this can kind of tell us something about you know did we sample correctly or what like who are we actually surveying did we expect to find certain certain information about certain groups basically so again bar graphs are very very simple very quick way to summarize the data that you have to say something about that data histograms are very similar to bar graphs but tell us something a little bit different they consist of contiguous adjoining boxes and have both a horizontal axis and a vertical axis like before horizontal axis is labeled with what the data represents for example distance from your home to school and vertical axis is labeled either with either frequency or relative frequency so whenever we're talking about histograms we are always talking about frequency or relative frequency now to give you an idea of what relative frequency is we have the calculation is essentially f equals frequency in is the total number of data values so we have the frequency of the data value that we're measuring we have the total number of data values or the sum of the individual frequencies and we can then calculate relative frequency by taking relative frequency equals frequency divided by total number of data values here on the vertical axis we place frequencies label this axis as frequency and on the horizontal axis we place the lower value of each interval we draw a bar extending from the lower value of each interval to the lower value of the next interval so we're connecting lower values essentially to each other and we get something like this so for the bar graph there were spaces in between and they were specific measurements of some specific thing here we are calculating the frequencies between some point basically so in this case number of books let's say the number of books that students read or read per semester the frequency here are the number of samples that we took so in this case we can say that about 11 people read between 0.5 and 1.5 books last semester about 10 people or 10 samples read 1.5 to 2.5 books last semester about 16 samples read 2.5 to 3.5 books last semester right now we're not measuring when this doesn't show us the exact measurement so in this case between 2.5 and 3.5 maybe you know the majority of people were reading or even everyone in that group range was reading 2.5 books last semester but because we have this range we're also adding that's basically within the range of 2.5 to 3.5 so it doesn't really tell us about specific values it just tells us about groups of values or frequency groups frequencies of groups what this data immediately summarizes is that you know as we would expect some people read but more books than others in this case the the kind of middle line between 2.5 and 3.5 the majority of people were reading that many books but we also see a lot of people not really reading very much so again if we had this data which could be quite a few data points really we can immediately how can I say this if we if we just lined all of the data points up it would be very difficult to pull out any type of pattern but because we're showing it in terms of a histogram I can very very quickly say that first off very few people read more than 5.5 books per semester you know only two in our sample and quite a few people or quite a few samples read basically more or less one book per semester so what we might be able to say to that or what we might want to say is that maybe we want to ask how can we make people read more or we can use this data to try to plan our strategy and measure basically this is our starting point and we want more people to read you know up to three books per semester so then we can measure again next year and see if this distribution or this histogram has actually changed okay so now I want you to try to use this data to make a histogram basically we have all of this data in a chart and this is how we normally have our data we have the number of hours my classmates spent playing video games on weekends right the data as it is right now doesn't tell me too much it's very difficult to pull any type of pattern out very easily from this from this data right so what I want you to try to do now is create a histogram I pause the video make a histogram of this data and see if you can pull out any interesting information remember we have frequencies and you can define your own ranges right so it could be you know one hour up to 10 hours or five hours or whatever you want just try to make a histogram from this okay so now that I hope you've you've attempted to make a histogram here's an example of the histogram in five hour intervals right so what does this histogram tell us it tells us that's you know people first off play a lot of video games and the majority of people the majority of people spend a lot of time basically above 15 hours on video games in the weekends from this sample we don't necessarily know where the sample was coming from but we can see at least that in this sample people play a lot of video games on the weekends and very few people that were sampled don't play video games right so we went from let me go back we went from this table where it's actually very difficult to pull out any type of pattern to pulling out a very very clear pattern based on five-hour intervals the number of people that's play five-hour in essentially five-hour increments so it's a very quick way very easy way to find patterns or gain information about the data that you have okay so next is frequency polygons very much like line line graphs but we make frequency easier to interpret we first examine the data and decide on the number of intervals or class intervals to use and the x-axis and y-axis we begin plotting the data points and after all the points are plotted we draw line segments to connect them so this is very much like a normal line graph except we are specifically focusing on frequency just like histograms the difference here is that histograms connect in a bar the lower bound of each each segment right and here frequency polygons basically just measure at each segment so in this case for 40 44.5 we had in this sample zero for 54.5 we have a 10 what is the scores so we have a 10 point difference so 54.5 we have a frequency of five samples it could be the scores for five students something like that so one way in the histogram we're actually making a bar chart and we're connecting the lower bounds of each group for frequency polygons we are just measuring at the lower bound of each group and then connecting them with bar charts so this allows us to see changes basically yeah changes in frequency distribution over time it can be used for a couple different things personally I tend to use histograms much more than frequency polygons but just know that frequency polygons do exist and they are quite useful for certain types of data okay so now we have two different charts and imagine that we want to compare these charts we have the lower bound in the upper bound we have the frequency we have the cumulative frequency for both sets frequency distribution for calculus final test scores and frequency distribution for calculus final grades and imagine that we want to compare these here comparing them with a histogram doesn't really make sense we could compare them with histogram but there are better ways so in this case frequency polygons are a better choice so in this case imagine that the light blue is the final test grade and the final grade is the darker darker blueish purple right so in this case I might say that okay people did actually worse on the final test grade then their overall final grade and that completely makes sense because you have things like homework that if they did very well on they'll get a better overall grade whereas if they did worse on the exam then they'll yeah they'll just have that single basically exam point so in this case I can see that from the final test grade we had actually lower lower number of a's a higher number of it looks like B's or C's and a higher number of well it actually looks like D's or F's right and I can compare those two data sets very very quickly with this so I can see the difference between the test grade and the final grade quite easily whereas if I was just looking at these charts it would be very difficult to look at the difference or compare the difference I mean if I look at the frequency I have basically 5 versus 10 10 versus 10 30 versus 30 40 versus 45 and 5 versus 15 now I know that there's a difference but whenever we see this visually whenever we describe it in some sort of chart or graph using descriptive statistics then I can say something very quickly about the patterns that emerge and finally I believe this is finally time series graphs are very good for showing change over time we use this a lot for a lot of different things for example the amount of people with a certain type of cancer over the years the amount of crime that happens over a certain time period overall globally crime rates are actually falling year by year the yeah so what does this actually tell us well it can tell us about patterns it can tell us about trends in this case we have the annual consumer price index per year right so we want to measure if something is increasing in current something is decreasing if there were any big bumps in the middle what do those bumps tell us what can we learn from them so time series graphs by themselves can potentially be useful depending on what you're measuring or what you want to know but compare combining them with other types of data as well might also be interesting so for example here the annual consumer price index is increasing maybe we also want to compare that with the number of sales from our coffee shop right so if the price index is increasing and how is that affecting our sales you know so then we could measure our sales our overall sales from 2003 to 2012 and see how they compare to the annual consumer price index and maybe see if there's some relation there just comparing the graphs by themselves does not tell us if there's a relation but it's a good place to start to understand if there might be okay so that's it for descriptive statistics basically it's just trying to get your data organized in a way that tells us something about the data itself we're not doing in-depth analysis we're not trying to infer any information we just want to know directly what does the data say and we try to put that in terms that people can understand patterns in the data very quickly now these patterns by themselves may be interesting for for some types of decision-making but basically they're very superficial right so what I tend to use descriptive statistics for is first off describe the data that I have you know what what does this data say about the people that I questioned and then I use the data to ask deeper questions that we have to use inferential statistics to actually answer so I tend to generate some hypotheses with descriptive statistics and then use other methods to go deeper into the data but descriptive statistics is a very easy very quick way to do to gain some more information about the study that you've done so that's it for today thank you very much