 You are able to see the the the power point. The presentation. Yeah, OK. OK, so welcome to your third session of STA 1610. Today we're going to do descriptive statistics. So before I begin with today's session, so let's remind ourselves what we've learned so far. We've learned that. Statistics is a method of analyzing, describing the data and transforming the data and into useful information in order for us to make decision. That's statistics and we learn about the key concepts within statistics like your population, your sample. We're going to continue learning about those concepts and we learned that the measures we calculate from a population, we call them parameters. And today you're going to learn about those ones as well. We're going to calculate those parameters. We also learned about that when the sample, when the population is big enough, we select a proportion of it, which is the sample. So we take a subset of that population and we calculate some measures. And when we create that proportion, we call that a sample because it's just a subset of that population. And the measures that we calculate from a sample, we call those statistics. Then we also learned that there are characteristics that describe either a population or a sample and those characteristics we call them variables. And within a variable, there are values that you get and those values we call them data. So today, with descriptive statistics, we're going to use the data to calculate the parameter or to calculate statistics, okay? So we're going to learn about that later on. Then we also learned that from the variables, there are two kinds of variables that you can get. You can get a numerical variable, which is also called a qualitative variables. Those are variables that you can put into categories. We also learned that we have numerical variables and those are the variables that are also called quantitative variables and those variables, either you can measure them or you can count them. And that is the type of variables we're going to be using today, numerical variables. And we set those variables that we can count, they are called quantitative discrete variables and those variables that we measure are called continuous or quantitative continuous variables. Then we also learned that within the variables themselves, there are levels and we know that there are order in terms of the levels. Categorical variables, the levels of measurements are lower, are the lowest levels of measurement because there's not a lot of things that we can do with the variables that are categorical. And we learned that the quantitative, which is a numerical variables, they have a highest level of measurement of scale. So we also learned that those levels are for categorical variables, it's nominal where there's no logical order or natural order. Or we also have, or end, we also have ordinal variables, those are the variables that has a logical order or natural order. Then we have quantitative variables and we have interval and we also have ratio. So we learned that interval, there is no absolute value of zero. Zero means another number. And we also learned that we have a ratio and a ratio that is an absolute value and absolute means nothing or that thing does not exist. So continuing with today's session, which we might not finish everything. So we're going to look at how we summarize the numerical values by applying the measures of central location and the measures of variation. By the end of the session today, you should learn how to describe the properties of central tendency or central location, which tells you the locality of your data, how your data or where your data is located. Then we also have, you need to be able to describe the properties of variation, where we look at how far apart your data is from the mean or how the space is your data from the mean. Then we're going to also learn the properties of the shape of your data. We did touch on the shape of the data when we looked at the histogram. So now we're going to look at when we use the measures of central location, which is the mean, the median and the mode, and when we use the measures of variation as well as the quantals. And then we're going to learn how to find the quartile position and the quartile value. And also using the quartile value and values, we can define your five number summary. And with that five number summary, which tells you the smallest value, your quartile one, your quartile two, your quartile three and your highest value, we are able to construct what we call a box plot. So a box plot, the reason why we didn't mention it in the previous section where we were summarizing the data is because we're going to introduce it in this section. It's also another plot that we use to summarize numerical data apart from the histogram, the polygon they all give and the frequency distribution table. We also use the box plot and also the scatter plot. And remember with the scatter plot, we said we're going to deal with it when we do logistic regression study unit. Okay, so let's get to it. So measures of central tendency, which tells me where your data is located, the location of your data. We have three measures of central location, the mean, the median and the mode. The mean is the most used measure of central location. We use it also in our day-to-day lives. So we talk about the average, which is the mean. So how do we calculate the mean? The mean means the sum of all values that you have divided by how many they are. So if I have my data as I have, oh, sorry, I need to enable the pen. If I have two, three, five and six, those are my data, I have four data. And to calculate the mean, what the mean mean is the sum of all of them. So therefore it means I must add all of them. That's the sum of all of them and divide by how many they are. They are one, two, three, four and divide by four. So it means I must, to calculate the mean, I will say it is two plus three plus five plus six equals 16, there are 16 values. The sum of all the values is 16 divided by how many they are. There are four, therefore it means my mean will be equals to four. And that's how you find the mean. The mean is affected by the extreme outlier. So let's assume this exercise that I just did. If in my company I have four employees, one of the employee ends 2,000 rent, the other employee ends 3,000 rent, the other employee ends 5,000 rent and the other one ends 6,000 rent. When I calculate the mean, and I know that the mean now is 4,000 rent, that tells me that on average in my company, all employees on average, they're getting paid 4,000 rent, which more or less because we've got two employees at the high end and two employees at the low end. But on average, we pay them 4,000 rent. So it means everything is normal. What happens if I have an employee that I pay 20? If I have an employee that I pay 20,000 rent out of those four employees, now it means I no longer have four, I have five employees and that will be 16 plus 20, which gives me 36 divide by five, which tells me that on average in this company of mine, I pay my employees on average 7,000 because 20,000 is my outlier. It affects the average of the entire company. So it means I will be telling lies if I use that and say on average, this company is paying their employees well, but we know for sure that only one employee gets paid more than the other because of that extreme outlier, it skews the data totally. And that is why in statistics, we ignore the extreme outliers because then they skew the data, we don't see the picture correctly. So, and that is the mean. We spoke about the measures that we calculate from a sample. If this was a sample, then it means I've just calculated a statistic. And when we calculate the sample mean, we use the formula, which is our sample statistics X bar, X bar is our statistic, is the sum of your observation from I starting from one up until five, divide by N. So this we call statistic, statistic, just a statistic because it's only one measure, statistic. And to calculate the population mean, which is the parameter, to calculate the population mean, we use mu. You need to also take into consideration because sometimes in the exam, I might ask you what is the sample statistics and they give you all these values like the X bar. For the mean, we're going to use the normal letters that we understand. And for the population, because it's big, we're going to use the Greek letters. So for example, this parameter, we use the mu, which is the sum of your observation divide by N. And you can see that both of these two formulas look almost exactly the same because the mean is the average, which is the sum of the values divide by how many there are. The only difference is with the sample, we divide by a small N, with the population we divide by capital letter N, but it means one and the same thing. If you use a calculator and you use your scientific calculator to calculate the mean, the mean for the population and the mean for the standard deviation, for the sample will always be the same. We use the X bar on your calculator, but later on when we do the calculations on the calculator, I will show you, but I don't think even it will be today. Hopefully it will be Saturday, we will look at that. Okay, I'm not going to do an example of the mean because we already did that. To say the mean is the sum of all values divide by how many there are. The next measure of central location is the median. The median is the middle value. So when the mean has got a lot of outliers in statistics, in practical use of statistics, we don't exclude the mean, but we then, instead of using the mean, we use the median because the median tells me the middle value regardless of whether I have outliers or not, the middle value will be my median. So when you calculate the mean, you need to always sort your data from lowest to highest. When you calculate the mean, you don't have to sort your data, you just calculate the mean. But the median, your data needs to be ordered, needs to be sorted, needs to be from ascending order. So it means you need to start from the smallest value and go to the highest value. And your mean, your medium, it's easy to find if you have odd numbers because to get to the median will be the middle value of that. So let's say I have two, three, eight, I'm already sorting my data, 12 and 14. So my data is already sorted from lowest to highest. So to find the median, because it's the middle number, I can start counting from this side. I can say one, one, I start process of eliminating. So when I eliminate one and one that side, one and one, so the number in the middle is eight. So my middle number becomes eight. So when we have odd numbers, meaning one, two, three, four, five is odd number. One, two, three, four, five, seven is an odd number. 13 is an odd number. It's easy to find the middle value. The challenge comes when you have even numbers. When we have even numbers, we need to take the average of the two values. Where the position? Yes? Yes? Did anyone ask a question? So that seems to drag to you. I don't know if it's just me, but it's to be public. Can you hear me? No. Can you hear me now? Yes, I can hear you now. No, I'm just saying, I don't know if it's just me, but you seem to be breaking up a bit. So I'm not too sure if it's just my connection or if it's you in general. No, I think it's my connection. Yeah, I'm fine. Mine's very clear on my side. Okay, then it's your connection from your side. Okay, so when we have even numbers, then we take the average of the two values, where the position is located. So when we have even numbers, we can use the position to find the median and to the position that we use is n plus one divided by two. And you can use this also even when you have odd numbers. Let's say you have 26, 27 values. So you can go and find the middle value by saying 27 plus one divided by two. That will be 28 divided by two. And that will give you 14. And then you can count one, two, three from the beginning because your data will be sorted. You come from the smallest value. And when you get to number 14, then you stop, that is your middle value. So it's very advisable to use the position to find your median. Okay, so let's look at another example. So let's say I have two, three, seven, eight, 10, 14, 17, 18. So one, two, three, four, five, six, seven, eight. I have eight values. These are even numbers. So since it's even numbers, I'm just going to calculate the position. So my position, it's eight plus one divided by two, which gives me nine divided by two, which will give me 3.5. So I'm going to count one, two, three. It's 4.5. It's 4.5. Hey, my meds. So it's 4.5. So I'm going to count one, two, three, one, two, three, 4.5 is somewhere there. So my 4.5 falls between eight and 10. So since my middle position falls between two values, falls between two values, then I need to take an average of the two values. So I must take eight plus 10 divided by two, which will give me 18 divided by two, which gives me nine. So my middle value or my median will be equals to nine. And that is the median. The median is not sensitive or it does not get affected by the extreme liars or the outliers or the extreme values. So you can use the median when you have outliers. The other method of central tendency is the mode. Now, the mode is the mode of the median. Most appearing number or the most frequent number or the number that appears more than the other numbers in your data set. With the mode, it is also not affected by the outliers because we're not looking at the highest values or the smallest values, we're looking at the number that is repeating. What do I mean by that? The number that is repeating? If I have one, two, three, four, four and five, four and four appears more than the other number. And that is the mode. And that's what I mean by... Not too many of you. First one. Please remember to mute yourselves. And when we did the categorical... Oh, the measures of summarizing the data in terms of... Oh, numerical data in terms of charts. And we looked at the bar charts and we looked at the histogram and all that. And I said, with categorical data as well, you are able to find the mode because the mode is... If you look at the bar chart, will be the category that has most of the values. And when we look at the histogram, the histogram will be the model... A class width will be the one width. Many... The height will be bigger than the rest of the other class width as well. And that's how you find the mode. And in terms of numerical values, if you do not put them in terms of a histogram and you're looking at the data, the pure data, sorted information or not, it is the number that appears more than the other numbers. Okay. So with the mode, you can have no mode. If I have my data S, if I have one, two, three, four, this data set that I have, no other number appears more than the others. All the numbers appear once, so there is no mode. So with this data set, we say there is no mode. Sometimes data set can have several modes. If a data set has two data sets or data points that repeat like four and four in this instance, then we call this data set a by-modal data set. Because there are... Oh, sorry, not this. Sorry, my bad. Let's go back, sorry. We can have several modes or we can have no mode. So now this one has mode. So we can say this data set has a mode of four. So this one, the mode is four for this data set. Sometimes you can have two values that appear more than the others. So yeah, I have one, I have one and I have four and four. And this data set now it is by-modal because my modes are one and four for this data set. If the data set has more than two modes, which is like in this instance, if I have, if I'm going to add without you, my six looks the other way around. Six and six, so now I have six, six, one, one, two, three, four and four. Therefore it means I have more than two modes. I have what we call multi-modal because my modes for this data set is six, is one and is four. So you need to know that you can either have a no mode, you can have a mode when there is one number, you can have by-modal when there are two modes and you can have multi-modal when there are many numbers that appear more than the others. So what happens if I have two, two, two and three, three, there are no modes, all of them they appear twice. So there is no other number that appears more than the others. They all appear almost like one, one, one, one, one. Oh, two, two, two, two, two. So in this instance, there is no mode. But if I have my data set looking like one, one, two, three, three, four, not 44, four, four, five. And then five. Therefore it means my modes are in this data set. It's one, three, four and five. Because all of them appears more than two, cause two only appears once. So I have a multi-modal data set and that is the mode. And with that, here is your exercise. You have five minutes. Those who are able to type in the chat, you can type your answer for the mean. You can type your answer for the position, remember, to find the position. Oh, let's start with the mean. This is the result of your sample data for the final results for this module. And students scored 25, 51, 27, 10, 35, 41, 51, 57, 62 and 31. You need to calculate the mean, which is the sum of your observation divided by N. That is your mean. The sum of your observation, these are your X observations. Your observation divided by your N, which is how many they are. Find the position. You will find the position by using N plus one divided by two. Then you can go and find the values. But before you find the values, you need to sort your data or order your data in an ascending format. You need to also find the mode, which is the most frequent number. When you have your answer, you can post them on the chat and then after five minutes, I will see how far you are on the chat and then we will discuss the answers just now. Remember, if there is an answer that you agree with, there are emojis. You can use the emojis. In the meantime, Ian, if Ian, please double check your answer. There are 10 observations. Am I already wrong? I'm just giving you the hint. There are 10 observations. So far, I see multiple answers. And I think you all are... You don't have to post your own if somebody has already posted something that you agree with, so that then I can see the differences. Okay. There are different answers. Different answers. That's good. It shows me that you guys are working in the background and not listening to me alone talk. Okay, so let's answer the question. Let's help those who didn't find their answers because there are no options where you can compare. So I'll start with the mean. So you guys, you did the calculations. So you're going to give me the answers. So the mean is the sum of all of them. So if you add 25 plus 51 plus all of them, until 31, you get 428. Oh, I shouldn't even have said it. And how many they are? 10. They are 10. So 428 divided by 10, please. For the 2.8. 42.8. When you are still practicing and going through the exercises and understanding the concept, don't be scared to redo the work. Just to double check that you didn't miss anything because sometimes you see here, during the exercise, I'm going to give you limited data that can enable us to go through the content as quickly as possible. But in the assignment, you will be given too much data, like maybe 26. So it means you have to spend time calculating and adding and adding and adding and adding and adding. When you get to the answer, if even if before you look at the option to see if the option is there, really calculate again. There's no harm because you still assignment or you're still learning. Only in the exam, you might want to run quickly because of time. Time moves quicker when you are writing an online exam as well. So now, just to get familiar with how you do the calculation, repeat your calculations just to double check your work. Okay, done talking about that. So now let's move on to the mean, median. With the median, we need to find the median position. Oh, sorry, I want to use blue. So let's find the median position, which is n plus one divided by two. We know that our n is 10 plus one divided by two. 11 divided by two, what is our median position? 5.5. It will be 5.5. So you need to sort your data from lowest to highest. Mine is not sorted. So if we quickly sort because you already have the data, let's sort it. The first one is 5. 25. 27. 31. 7. 35. 31. 31. 31. 35. 35. 41. 41. 41. 28. 48. 48. 51. 51. 51. 51. 57. 77. 62. 72. 1, 2, 3, 4, 5, 6, 7, 8, 9, 10. So we have 10 of them. So we know that our position is 5.5. So we need to count 1, 2, 3, 4, 5.5 will be located between 41 and 48. So we need to go and say 41 plus 48 divided by 2, which gives us 41 plus 48. 89. It's 89 divided by 2. 44.5. It is 44.5. 44.5. That is our median. The mode is the number that appears more than the other numbers. Our mode for this data set is 51. It's 51. So our mode is 51. And I also had those answers for you. Okay. And that is measures of central tendency, which tells us the location. Now I'm going to show you, in terms of the measures of central location, you can also tell the distribution of the data that you're working with by looking at the mean and the median. Sometimes we always use the mode, but for now we're going to use the mean and the median. So if your value of your mean is less than your median, if the value of your mean is less than the median, then your data is less huge. If your value of your mean is equals to your median, then we say it's symmetric. If the value of your median is less than the mean, or the mean is greater than the median, then we say it is right skewed. We say left skewed, which is negatively skewed, negatively skewed for left skewed, and right skewed will be positively skewed. So this, we can also say it is negatively skewed, or we can say this one, it is positively skewed, because the tail goes to the right, which this one, the tail goes to the left. So if I look at the answer that we got, what type of distribution is this? Anyone, from the data that we just calculated, what type of distribution is this? The mean is less than the median. So it is left skewed. It will be left skewed because left skewed means the mean is smaller than your median. Okay. Now let's move on to measures of variation, which tells us how dispersed your data is, how far apart your data is from the mean, which is your measure of central location. It gives you the variability of your data set that you have, that it means data is gathered