 today's session. Today's session starts from 12 o'clock until 2 o'clock and I should just welcome everybody to your second session. And today's session we are covering study unit 3 which is your descriptive analysis. It's a build up of what we did yesterday. So remember yesterday if we can recap on what happened we understood the basic concept when it comes to statistics. We looked at what we mean by statistics, what are the branches of statistics which are your descriptive and inferential statistics. So today we are doing descriptive statistics and we also looked at the the terms that relates to what we will be doing in statistics that are your population and the measures that come from the population are parameters and the population if is too big we select a sample and the measures that come from a sample those are your statistics. And then we also said once we have collected that data but we ask questions we look at variables. So what are those variables? Variables are just characteristics that describes the population or the sample and within a variable we get a data and that is what we use. It's a measure or a value that corresponds to your variable. Then we also looked at if we know those that we have the variables and we know that the variables can be qualitative and can be quantitative and we said if they are qualitative we can put them into categories and also if they are qualitative it means they can be categorical variable. And we also said they can be also numerical variable which we call them the quantitative variable and those variables can either be discrete or they can be continuous. If they are discrete it's something that we can we count. If they are continuous it's something that we measure and that is what we're going to be doing today is continuous discrete quantitative variables. We're going to be using those for the analysis today. Then we also went on and look at the visualization. I'm not going to go deeper into the visualization because I just wanted to remind you about the variables that we use and the types of variables to introduce today's session. So today by the end of the session which is the two hour session we're going to use your calculator. We're going to start things with the manual calculations as well. Then I'm going to bring up some calculations where I show you the easy way of using your calculator to do this scientific or the state's calculations using your scientific calculator which makes life easier. It saves you a whole lot of time and you will see during the session what I mean by it will save you a whole lot of time but you need to understand the basics because what if sometimes they don't give you the calculation but they ask you content. They ask you questions relating to how you do things. How do you calculate them? You cannot rely on only your calculator. You need to also understand your content in this regard. Okay so by the end of the session you should be able to describe what the properties of the central tendencies are which are your measures of central location or locality. We should be looking at the variation how variable or how far apart your data is from your mean. We should look at the distribution of your data. Remember we covered this yesterday as a snapshot when we looked at the histogram. We looked at the distribution of your data whether it's symmetric, whether it is left skew or right skew. Today we're also going to cover that to tell you when you use measures of central tendency or you use measures of variation how do you check that your data is skewed or your data is symmetric. We're going to not do the box plot and construct a box plot that will be on Friday next week. So only for today we're going to look at the measures of central tendencies and the measures of variation and look at the distribution of your data. And to start off with when we look at the measures of central tendencies there are three measures of central tendency and what measures measures of central tendencies like I said it is the measure of locality it tells you where your data is located it gives you the location of your information your data by means of the spread is it do you have the average do you have so it tells you a lot about and how many number of then you repeat them so more than the others whether you have a middle number all those things that's what it covers and or it gets covered under this measures of central tendencies and we're only going to cover those three which the first one is the mean the mean which is the most commonly used measure of central tendency it measures it takes all the values that you have and we divide them by how many they are that is the mean so if I have one two three four five I will add one two three four five and then I will divide them by how many they are there are only five values that I'm using so I will just divide them by five and that is what the mean is about the mean is affected by extreme liars yes what do I mean by that for example let's say I work in HR let's say in HR we are giving people salaries let's say one person and ten runs one person and ten runs the other one ends 20 the other one ends 30 and the other one ends 100 rent if I calculate the mean which tells me the average which gives me the average values of the salaries if I need to calculate that so I will have to all of them remember is the sum of all of them so I will say 10 plus 20 plus 30 plus 100 and that will give me 160 so if I add all of them they and divide them by one two three four so divide by four which is what I am saying there is the sum of all values divided by the number of values so the number of values is how many they are so they are 160 divided by four divided by four and it equals to 40 rent so when I do the analysis and I have to report back to the executive and I tell them oh in this country in this company that we have we are paying our employees good money on average we pay them 40 rent but the data tells us otherwise so that is what we say the mean is affected by extreme outliers because then it doesn't give the the correct picture because of this outlier and this is what we call an outlier or an extreme value it's a value that drags everything out out of proportion going forward so we can calculate the mean remember we are able to calculate the values from a population and from the sample so to calculate the mean using the sample information we use the formula x bar which is the mean which is the sum of all the values that satisfy the observation under the sample side the sample unit or the sample size divided by the sample size so if we select the sample and we do some analysis from this we're going to if we analyze the text score or the exam score the average exam score of students who registered in 15 in 1610 and those who only study at the western Cape then it's just summing all the scores of all the students in the western Cape region and dividing them by how many number of students are registered in the western Cape or who are registered for STA in the western Cape region that is the mean for the population is the same thing we use the mu which is the population parameter remember this is a statistic this is a parameter so let's go back to that so this is what we call a statistic because it's the measure that comes from a sample and when it comes from the population we call this a parameter which is the sum of all observation that satisfy the population divided by this population size and you can see that the two formula looks exactly the same the only difference between the two formulas is that for the sample size we use a small letter n and for the population we use a capital letter n but in a nutshell the mean calculation is the same the sum of all the observation divided by how many there are I did the example using the extreme the outlier so if I have the values 12, 13, 14, 15 and I want to know what is the mean which is the average then if these are my salaries then it means on average we pay our staff the team rent or if this is in thousand rents it's 1,300 if it's in 10,000 then this will be 18,000 rent and to get that dating we just add all of them divided by how many they are there are only five observations it's 65 divided by five and they're stating and that's how I calculated the average the mean measure of central tendency is the median the median is the middle value don't get confused with the mean and the median sometimes when your data is or your data has a lot of outliers we prefer to use the median as the average because then it tells you the middle position of or the middle value of your data set so the median is your middle number and it's not affected by extreme outliers however if you have for example not however actually when you have a lot of data in your data set it's easy to calculate the median because or if you have a smaller data set let's say for example let me not confuse you let's say for example we have one two four six seven when we went with the median as well our data needs to be sorted from lowest to highest when we work with the mean nothing doesn't matter how your data looks like because we don't care about the order but when you work with the median the order is important if I have a small data set like this it's easy to identify my middle number because my middle value will be that very day because I can see if I move from left and I move also from the light from the right I will end up four as my middle number what happens if I have a huge data set let's say we have 20 values and then it's going to happen even in the exam they might give you 15 values 20 values that you need to calculate the median to do that you need to use a position a position locator or the position number and we use this formula n plus one divided by two find the position of your median now when you use this position allocator or the position number or the position value for the median you will end up with two scenarios orderly actually if your number or the count yeah we have one two three four five so it means it's odd number if your count is an odd number like this then the middle value will be your median it's easy to identify to get that but if your you have for example even numbers because now I have six values which means it's an even value therefore the median we will calculate by taking the average of those two positions so what do I mean by that when we use this position to go find the median let's say for example I'm using the same example that I have here there are six of them so it will be n will be equals to six plus one because they are one two three four five six plus one divide by divide by two so now my data as well is not sorted so I will have to come back and re-sort the data I have a six and I have a seven so that it makes sense because your data needs to be sorted from lowest to highest so now when we calculate the position we get six plus one equals seven divide by two equals 3.5 and 3.5 means I can count from one and go to 3.5 it will be in the middle of two or four and six when it is in the middle of four and six when it's 3.5 I need to take an average of these two values so I will say four plus six divide by two equals so that will give me 10 divide by two equals to five and that will be my median so now going back this is my position and this is my median so you go and use the median position to find the median value let's change again let's say we have eight as a number there still have to use the median position so to go find so there are one two three four five six seven so now there are seven plus one divide by two which is eight divide by two which is equals to four now I need to go find my median value which is one two three four therefore my median value is six and that's where the two points they come in if the values are an odd value then the middle value will be that median if the values are odd oh sorry even then we take the average of the two that is the median and how we find the median any question before we move to the middle no question okay so like I said the median is not affected by extreme outliers now the other measure of central tendency is what we call the mode the mode is the most appearing number the most frequent number the number that appears more than the other number not the biggest number but the number that appears more than the other numbers and that is what we call the mode and it's also not affected by extreme outlier because we only looking for or we interested in the number that appears more than the rest of the other values sometimes the categorical data we can use the mode because we can find the mode of categorical data which will just tell us which category has the most frequency or the count or the highest percentage and that is where you can use the mode when you discuss the categorical data but you don't use it to calculate the mode in any of the categorical data you can use it to describe now when it comes to the mode you must understand the following they can be no mode they can be a mode or they can be several modes what do I mean by that if I have two three four five six these are my data values if I look at this data values I can see that there is no number that appears more than the other number so in this instance there is no mode what if I have two three three four five if I look at this three appears more than the other numbers so yeah we have a mode and the mode is three we can also have two three three four five five if I look at this data set I can see that yeah we have two modes and in this instance where we have two modes because we have three and we also have five as a mode we call this a bimodal data we call this a bimodal data because it has two modes what if I have one oh sorry one three not three three I must make a distinction a clear distinction between the values two two three three four four and five now I have more than two modes so yeah I have three modes and the modes are two three and four this we call it a multi modal data and that concludes the measures of central tendencies any questions if there are no questions you have the exercise to do no questions from my side okay yeah is your exercise remember yeah is your data set and this is a sample data set we can just call it a sample data set of students who are doing 3601 final year exam results these are their results and you can calculate the mean remember the mean if this is a sample we say is the mean is the sum of all observations divided by how many there are remember also that is just to add all the observation and divide them by how many they are so sorry my notification keeps on popping up on the screen that's the mean the median you first needs to find the position remember n plus one divide by two and then locate your value but the first step is to sort your data then find the median to sort your data in an ascending order which is from lowest to highest the mode you are looking for the most frequent value and that's how far I can give you a hint on how to answer this question you have 10 minutes like okay how far are we all done I actually have been thinking on on the answer so I gave you the answers because I was trying to to find the the picture so I can go to the other side anyway that will be the last time I give you answers so the mean since I gave you the answers is the sum of all values so I hope everybody got the same I agree with the answer yes the median you went yes and you found the position by using n plus one divide by two which gives at there are one two three four five six seven eight a line ten so which will be ten plus one divide by two which is eleven divide by two which will be five point five and if you have sorted your data in an ascending order you will find that the median is four two four point five agree yes and the mode is the number that appears more than the others only 51 appears more than the rest of the numbers agree yes any question anybody who didn't know how I'm sorry the the only one will be for me the medium I actually came up with 46 can you please take me through how you get to 44.5 okay so did you sort your values so let's let's try and sort these values around so which one is your so now talk talk to me tell give me your values so my values will be now 25 27 31 slowly slowly slowly I'm I'm writing them 25 27 yes 31 yes 35 yes 41 51 yes 48 okay so you sleep 48 and then you will go to 51 and so far so anybody else correct their answer so if it's five point five one two three four five point five it would have been those two values and that is why they got 44.5 the average 41 plus 48 divide by two which then gives you 41 plus 48 equals 89 divide by two which is 24.5 anyone else okay if there are no questions now when we talk about the mean the median and the mode we can also find the distribution by looking at the two values they describe how the data is distributed and they tell us the shape in terms of whether the data is huge or is symmetric when the mean and the median are equal we call that symmetric it means it's normally distributed it's symmetrical it's symmetric because the mean and the median are the same when the mean is less than the median it means there is a tail to the left therefore it is left skewed or we can call this negatively skewed it's left skewed or we can say it is negative negatively skewed when the median is smaller than the mean then the tail goes to the right and we say this it is right skewed or we can say it is positively skewed and that is the distribution when we look at the mean the median and we can also indeed the mode because the mode for this data set will be the same for the mean and the median there will be equal for this data set the the mode will be yeah will be your mode and on this data set this will be your mode because the mode is the highest peak is the one that has the most very thin so if we include the mode into the picture and for this one the mean and median and the mode will be the same okay and that is the shape and the distribution when we look at the measures of central location we're going to look at the measures of variation we're going to look at the range the variance standard deviation and coefficient of variation we will do the standard deviation and then we will take a five minutes break because I don't want to sit long in front of the computer and also talk long we'll take a five minutes break so that people can go get coffee or tea or water or something and then we will come back and we will use our calculators to calculate standard deviation the variance and the coefficient of variation then the then the next hour then it will be finished then we do exercises and so forth and so forth okay so let's look at the measures of variation what does that mean because measures of variation just gives us the the spread of your data it tells us how far apart your data is from the mean and if you look at this picture that shows you the distribution of the dataset you will see that one calf has a peak shape and the other one has a flat belly shape and those are the results of the variability within the data if this is our mean and we can see that with the one with the peak it's closer to the mean and the one with the flat belly most of the dataset are far away from them so it means that the standard deviation of the one with the peak is closer it's less it's it's smaller and the one with the belly calf that is flatter it means the standard deviation is bigger it might be 10 it might be 5 it might be 8 9 10 20 so forth and it creates that kind of a shape but when the standard deviation is smaller like 0.5 0.1 0.3 0.8 1 2 3 then the calf looks almost like it is tall okay let's understand and understand how do we coupling all these measures the first one is the range the range is the simplest measure of variation because we have that we have been doing the range since yesterday as well so we know that the range is your highest value minus your lowest value or your largest value minus your smallest value when you calculate the range you like the median you have to order your data your data needs to be sorted from lowest to highest and when you do this kind of exercises make sure that you recount your values and double check like we did with the exercise just now where someone missed to only one number it can give you the wrong answer which is correct on the sheet you will pick it you will choose that option because it's there and it will be the wrong one so make sure that you double check your work you recount the values and you recount the values that they gave you and see if you have the total number is the same what I like to do if if I can take you back as well one more time to that exercise what I like to do when I work with this kind of data every time I put the number like 25 I will go and scratch it in that way when I'm done with all the list of the values I will know which value haven't I included and then I can go and correct my data set again so please pay attention when you work with this kind of information okay so we did the range yesterday so in this data set we're going to start with one is our lowest and 18 is our highest so 18 minus one gives us the range so that is the spread of the data say only some differences between the highest and the lowest the variance and the standard deviation which also are measures of central location the variance on the other hand is the average of the square deviation of the values from the mean we also do not even interpret the variance now when we calculate the variance we can calculate it for the population and we can also calculate it for for the sample if you look for the sample variance the statistic is s squared this is where also you need to pay attention when we talk about the measures that comes from a statistic from a sample which are the statistics we always use the simple letters and like the the mean we use x bar you can always remember the x bar because you know that x is x with a bar you don't mean the mean s squared is your variance and s it's anybody can remember s so you just need to know that that is from for the sample statistic so to calculate the sample variance which is s squared you don't have to do anything with the x squared it's just the formula is the sum of your observation minus your mean which is the mean you know the formula we've calculated it previously that means central location measure mean and you square the difference divide by your sample size minus one so your small n is your sample size minus one and that calculates the sample variance the population variance which is denoted by a sigma squared and for the population parameters we use the Greek letters like with the mean we use the mu with variance we use the sigma which is a sigma squared which is also the sum of your observation minus the population mean square the difference divide by the population size now if you look at the two equations they are differences the population we divide by n whereas the sample size we divide by n minus one the answers it means there will be different so you need to pay attention when you look at the question as well in the example in your assignment to check the statement whether they gave you the population data or they gave you the sample data because then the formulas are different to calculate the standard deviation which we can also describe as the most commonly used measure of central tendency of variation sorry you will notice that we will use standard deviation across the chapter now from now on when we do chapter five when we do states when we do chapter seven eight nine and two except when we do chapter 12 so in chapter four and chapter 12 we don't use standard deviation and you will notice that we use that most often when we do the other calculations and it shows us the measures of variation about the mean so it tells us how far apart the data is from the mean like we did with the first slide when I introduced the the section it tells you those differences whether is it closer to the mean or is it far apart from the mean and the standard deviation is just the square or the square root of your variance so if you look at this the equation of a variance when we look at the sample variance you will see that the standard deviation we just put the square root in front of that equation and the standard deviation has the same units as the original data and that is why we can safely use the standard deviation to interpret the values or the data by just looking at that measure and the same you can calculate the population standard deviation which is also the square root of your variance and you can see that the equations are the same so when we do the exercise now or the example I'm going to only calculate the standard deviation but I will also show you how you get the variance and later on when we use our calculator then we can calculate the we can calculate the standard deviation using your calculator okay if I have this data set and I'm told that this is a sample data which has eight variables or eight values the same unit I calculate the mean because the mean is easy to calculate is the sum of all of them divided by how many they are I add all of them divided by eight and I get 16 as my mean then I need to come and calculate the standard deviation remember the standard deviation formula is s is equals to the square root of your sum of your observation minus your mean since we're using the sample so we'll use x bar squared divided by n minus one and if we know that is the formula now when it comes to the summation summation means summing of the values so it means we have to sum everything that is in the bracket every time so we'll have to sum the first observation minus the mean squared the second observation minus the mean squared the third observation minus the mean squared and that is what we are doing with the summation at the top n minus one and we can substitute the mean we know that it's 16 we substitute the value of our mean and our n is eight and we calculate by using the calculator to do 10 minus 16 squared plus 12 minus 16 squared plus 14 minus 16 squared until you get all the values and when you do all that and end them up because it's the summation you'll get 110 divided by seven now everything that is underneath the square root remember is the standard deviation is your variance so 130 divided by seven if we take the square root of that we get 4.3095 and this is the measure that tells us how far apart each value is from the mean from the mean and like I said everything that is underneath the the root sign is your sample variance so if I take 130 divided by seven I will get the variance of 18.514