 Please remember, as soon as Adele has posted the register in the chat, make sure that you do compute the register. I will also halfway through the session, I will remind those who joined late or who haven't yet the opportunity to post or to put in your details on the register to do so. Today we're going to learn the basics of how we work with numerical data. The first session that we had, we discussed the type of data, so we also looked at categorical data, but today's session we're only going to concentrate on the numerical data, meaning how to visualize the numerical data, how to also apply some of the measures in terms of the application of the data. Remember, going to also use the Newton error prompts to answer the question. Every time you see a question, ask yourself, do I understand this question that I have been asked? Do I know what is required of me? Is there, sorry, are there facts that I need to take into consideration? Is there a formula that I need to know about? And also, then you can start calculating after you have collected all the information, and then we're going to do a feedback, we're going to check our answers across everyone who is in the session. So I expect today as well to be as engaging as always, so please feel free as well to participate, type your answers on the chat, and if you can type on the chat, you can also unmute and let us know how you solved the problem as well. The next following weeks, which will be on the 30th, we're going to look at the measures of variation. And then on that week, I will let you know what the session plan looks like for me as well. Okay. So let's enjoy today's session. Do you have any question, comment, query before I start the session today? Any question? Lindsey, I do have a question that I posted in the chat, which is with regards to the first session, which I was unable to attend. And I request a copy of that recording, because I don't see it on the ISM of the UNISA side. Okay. So for any query regarding the recordings, if you can find them, please send an email to CTNTAT at unisa.ac.za. They should be able to respond to your email or be able to let you know when the recording will be made available. Adele, am I right? Can they? Yes, I can answer that. The technical team is still busy recording, so they should be ready this week. Okay. Thanks, Adele. Thank you. Okay. Pleasure. Remember guys, you don't have to wait until the session to ask all those technical questions and queries. I've shared the email address and also probably you did receive the emails from CTNTAT at unisa.ac. Please send an email there. They will respond to your query. Okay. So let's… Yes, just this. You say we send email to CM? CTNTAT. I will share the email address again. Lizzie, I'm posting it in the chat. Okay. Thank you, Adele. Yes. Okay. So let's look at today's session where we look at how we visualize numerical data. For today's session, for this part of the session, you just need your calculator because there will be some calculations that we need to do. And there might be a little bit of formulas that we use but not too intense. It's just formulas to help you understand how some of the calculations work. You don't have to memorize them because they are straightforward formulas. At the end of the session, you should learn how to visualize your data, your numerical data by putting them in an order from lowest to highest, by using a stem and leaf plot, by placing them in a frequency distribution table, which is a summary table, by doing a histogram or coming up with a histogram and a frequency percentage polygon graph and a cumulative polygon graph. And also going to look at how we put two numerical values on a scale as well, how to visualize two numerical values. Okay. So in order for you to be able to visualize the data, in order that you can use it to display a stem and leaf plot and with the frequency distribution table that you're going to create, which will have also the cumulative distributions, you can, you can, you can, you can. Please make sure that you are muted, otherwise it will do those echoes. You can create a histogram, a polygon and an or give. So going to look at each and every one of those visualizations. So an audit array is when you put your data in an order from lowest value to the highest value, so in an ascending order from smallest to highest. So here I have an age of survey college students. Here we have those who attend day classes and those who attend night classes and the ages are in order already. So this data is an audit array because you can see that for day students start with 16 year old, 17, 17 year old, 18 year old, 18 year old, up until you get to a 42 year old and all the data is sorted, including also the night student. So with the audit array, it will help us also to see or to visualize the range of the data because you are able to see your minimum value and your maximum value and you can calculate the range. I'm not going to tell you about the range today. When we discuss measures of variation, we're going to go into that. So with an audit array, it helps you to see if your data has any extreme values or what we call outliers. And the extreme values are those values that are far apart from the rest of the data. So for example, here, if I'm looking at this data, there are no extreme values unless if on this data set, we have a day scholar who is 70 year old, then that would be an outlier because 70 years, it's far from 42 year old. So unless if there was a 50 year old and a 60 year old in between, then we would say that there is no outlier. But 70 year old will make this data set an outlier. So what about if it was night students and we have a 12 year old. So we say this learner is very bright and they accepted the learner to come and do some of the courses at this college. A 12 year old will be an outlier from the rest of this group because 12 year and 18 year, there is a huge difference between them. So that 12 years will be an outlier. So that is how you will identify an outlier. It's a value that is far apart from the rest of the values. Okay. So once you have ordered your data in an ascending order, you then can do some visualization of that data. You can draw up a stem and leaf plot. And a stem and leaf plot groups your data into two parts, into the stem and the leaf. There will always be one stem, but there will be many leaves. So a stem might have, if you visualize this, you look at the tree. If you look at the tree, you always have a tree with a stem and then it has so many leaves hanging on it. Even when you look at the leaf, you have this one thing that is holding all these small bits of leaves that are coming out. Remember, if you look at the tree, let me see if I can draw a tree. So here is my tree with the roots and it's my stem. And you might find that there are all these leaves on this tree. So I'm going to assume that I'm drawing a palm tree here. But also with a leaf as well, if you look at, let's say this is the leaf that I'm taking out yet in this leaf, they might be something like this, which is a stem. And within this, you might find that there are those leaves that look like this. And those are your stem and leaf that we refer to. You will see when we talk about the stem and leaf plot, how that looks. Okay, so that is in a nutshell what the stem and leaf plot looks like. So when you think about stem and leaf plot, think about the tree. So we have one stem, many leaves. And there are different types of stem and leaf plots. Before I can even go further, you can get a tenth stem and leaf plot. Therefore, it means with a tenth stem and leaf plot, you only have two digits. 11 is a digit, 12 is one, two, sorry. 11 is two digits, 12 is a two digit, 23 is two digits. That is a tenth stem and leaf plot. We can also have a decimal, I'm going to just call it desi, a decimal stem and leaf plot. Let's say, yeah, I have 1.1, 1.2, 1.3. So I have two values separated by a decimal point. So if I look at this, let's start with the tenth, the tenth what? So I'm going to repeat all those values. So yeah, I'm talking about the tenth stem and leaf plot. The first number of this digit, the first number, so it means all these numbers here are my first numbers, those we're going to call them stems. And this second digit, we're going to call them leaves. And we're going to explain that in a short while. I'm going to explain that. The same thing, the number before the decimal point, this will be our stem, this will be our leaf. But when we have a hundredth stem and leaf plot, therefore it means I have 110, 112, 121. With a hundredth stem and leaf plot, the first two digits becomes the stem and the last digit become the leaf. So always the leaf will have only the last digit. So therefore, this will be my stem and those last digit numbers will be our leaf. Okay, and you can have a thousandth stem and leaf plot and so forth and so forth. So just remember that. What other properties do we have on the stem and leaf plot? So remember our day and night scholars, we can represent this data on a stem and leaf plot by looking at the stem. So we always have one stem, many leaves. So you look at the first digit, all these first digits where it's one. Therefore it means one is going to be my stem, sorry. One is going to be my stem. And two will be my other stem. And three will be my other stem. And four will be my other stem. And then I will draw a line. And I'm going to represent if the values repeat themselves, you need to also record those repeated values. So 16 will be 16. So this is 16, it's not 16, it's 16. You read it as you see it from your data. 17, so you must also read with the stem. 17, 17, 18, 18, 18, 19, 19. I have covered all the values with a stem of one. Then when I go to the twos, it will be 20, 20, 21, 22, 22, 25, 27. And those are my stem and leaf. And 32 will be 32, 38, and 42. And with this data, you are able to see the shape of your data. And we're going to get to that. So let's see how we visualize the night as well. There is the stem and leaf plot that we just did now. And this is how we visualize the night, the main leaf plot. If you have any question, remember to stop me anytime when I'm explaining and you don't understand something. Don't let me go far without you understanding anything. Like I said previously, yes, just this. Are you saying do we have any questions? Yes, if you do have a question. Yes, when you started this, yes, there where you are pointing, you talk about there was a 12 which was outside these blocks and there was a 70 which was outside on the other side. Or you said they are the outlaw, I don't remember. So can you just explain that one because I missed you there? Oh, when I was referring to the 70 and the 12, those are what we call outliers. Oh, sorry. They are outliers or what we call extreme values. So those are extreme values. So if I need to represent them again here, I can say 70, they will be there. And yeah, we will have a 12, a 2 here. But they are called extreme outliers. On the stem and leaf plot, you won't be able to see them clearly if they are extreme values. But on the data side, you are able to identify them immediately. OK, I said explaining how we draw up the stem and leaf plot. Always remember that 34, you can come from a, if they give you a stem and leaf plot, you can take a stem and leaf plot and put it as a data set. So three and a four, it's 34 because it's split into three as a stem and four as a leaf. And when you are asked questions about how many of the values are in here, or when you are asked whether you need to calculate the percentage, they say, what is the 50% of the data falls? Oh, how many? Let me put it this way. How many of this data falls within 50% of this data set? And then you should be able to count how many they are. By counting, you just count the leaf. So you do your frequency, you just count only the leaves. There are one, two, three, there are three of them. One, two, three, there are three of them. One, two, three, four, five, there are five of them. And you use the frequency to calculate the percentage. And then you can see how many there are. Remember calculating the percentage, say you're going to have the total and add all of them and get the total and then use your total to calculate your percentages. And you will know how many data falls within that 50%. We will do some activities and exercise to show you how to answer those questions if they are asked in the exam or in your assignment. OK, for now, I just want you to see if you were listening. As the Manage of an Insurance Claim Division, you have to set up a performance level. You have been asked or you have asked 20 of your experienced claim processing personnel to record the number of claims that they process during a specific week. The following data set was collected. The claims processed by 20 claim processors in a week are as follows. It's 46, 44, 33, 42, 41, 35, 37, 37, 27, 36, 38, 39, 31, 30, 40, 41, 48, 45, 30, and 28. The question here is, can you create an audit array, meaning arrange this data from lowest value to the highest value? And once you have done that, can you set up a STEM and LEAF diagram? I know that it's not going to be easy for you to share your picture. But if you are able to share your picture or how you develop your STEM and LEAF plot, you can do that on the chat so that everybody can also see. You have five minutes. Oh, let me put it three minutes, because it's not too difficult. You have three minutes. Are we done? Are we winning? I think so. Yes, Manas, Manisa, you are on the right track. Not you missing, oh, you're not missing your STEM. OK, I see what you did, because you typed. You must remember to draw the line in between your STEM and LEAF so that it makes it clear. At the moment, I'm seeing three-digit numbers. But you are on the right track, except you forgot the other 41 are not. But you are on the right track. They should be when you count all the LEAFs, they should give you 20. Oh, yes, Somsa, correct. So it seems as if you guys, you do get the grip. You understand what's happening. So you need to order your data. So I'm just going to write how I ordered my one. So it's 27, 28, 30, 30, 31, 33. How many 33 is one? 84, 85, 86, 87, 88, 89. Then we go to the 40, 41. There are two 41s. 42, 44, 35, 46, 48. And I think I've counted all of them. One, two, three, four, five, six, seven, eight, nine, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20. There are 20. I have 20, 30s, and 40s. So it means I'm going to have my stem, my LEAF. You don't have to write stem and LEAF like that. You can just draw the line like as you have done. You can just draw a line and say two, three, and four. So we'll know that this is the stem that is the LEAF. Two, three, and four. And placing my LEAFs, seven and eight, 27, 28. Every time you write the stem and LEAF plot, keep in your mind that you're not writing seven and eight. Say 27, 28, so that you don't forget this. 30, 30, 31, 33, 34, 35, 36, 37, 38, 39. And we have 40, 41, 41, 42, 44, 45, 46, and 48. And that's how you draw a stem and LEAF diagram How was the middle of plot? Questions? Excuse me, sir. I have followed you for now. I'm not in a position to write. I am in a car. And when I would stop, I would be able to try this exercise. No, it's fine. As long as you can follow what we are doing and you understand what's happening. Thank you, thank you. Sorry, it's just Liz. Yes. Well, if, for example, in an exam, they ask us in this example, how many LEAFs there are, would we count the zeros as part of the LEAFs? Yes. OK. Remember that your LEAFs, they will not ask you how many LEAFs there are, but they will ask you how many data you have. The zeros are not zeros. Remember that is a 30. It's a number. Yes, yes. OK, yeah. So it's a touch zero. Yes. OK. So that is why I keep on saying, always in your mind, remember that that is not a zero. Say 30, 30, 40, 41. If you keep on saying zero, zero, you will come in your mind, you will think that is nothing, nothing, nothing. And zero plus zero will be zero. And you're going to add things in corrective. So that is not a zero. It's a date. Right. Thank you. This is some of the typical questions because you write multiple choice questions. And with multiple choice questions, they will not ask you to draw up a STEM and LEAF plot or they will give you a STEM and LEAF plot or they will give you information that you need to state whether that you understand the STEM and LEAF plot. So for example, a STEM and LEAF plot displays or a STEM and LEAF plot display describes two digit integers between 20 and 80. So it means here we have 20 and 80. So it means it's 20s and 30s and 40s and 50s and 60s and 70s and 80s in the STEM and LEAF plot. For the purpose of the classes display, the row appears as such. So here they give you just part of the STEM and LEAF plot because if you look at that, remember, that is your STEM and that is the line that cuts through that splits the STEM and LEAF. So you should be able to identify that as my STEM and LEAF plot. It appears as five as my STEM and two and four and six as my LEAF. So 52, 46, oh, I'm giving you the answer right now. Oh, gosh. Why am I doing this to myself? Okay, so anyway, then it means I'm going to answer this question. The row appears as this with the five as the STEM and the two and four and six as the LEAF. What numerical values are being described here? 52, 54, 56. Yes, that is option three. So they just want to understand if you know how to take a STEM and LEAF plot and revert it back to the data set, right? That was easy, right? Let's see a question quickly. So if I ask a question like this, but they're referring to 100 or 1000, will they separate the LEAFs? Like I said, if you look at my response in the chat as well, so let's say for example, we have data set that looks like this. I'm not going to write so many of them. So if we have 90, 95, 90, 95, 100, 101, 110. If we have a data set like this, you can see that here, you have a mixture of 10th and 100th STEM and LEAF plot. The first digit, yeah, because it's two digits. The first digit is your STEM. On those ones with three digits, the first two will be your STEM. So therefore it means you will have nine, 10, and 11. So that is an 11, an 11, and you're going to say 90, 95, 100, 101, and 110. Okay. They're going to wake. Okay, follow. Thank you. Okay. No problem. So now let's look at how we do a frequency distribution. A frequency distribution is a summary table. Remember when we were doing categorical data, we used a frequency table, which is a summary table of categorical data. This is a summary table for numerical data. So if a manufacturer of installations randomly select 20 winter days and records their daily high temperatures in degrees. So yeah, we will be given those data and the data is sorted already here from lowest to highest. Our lowest temperature is 12 degrees and our highest temperature is 58. So we can take this data set and build a frequency distribution. Now, because here we don't have categories, we need to create those categories and the categories are what we call them classes. How we do that? We first, the first step. So if I have QMI students in here, you don't have to learn how to, no, who doesn't have to learn how to develop this, nobody. So everyone needs to know how to do this. So the first step is to find the range of your data. The range is your highest value minus your lowest value. My highest value is 58. My lowest value is 12. 58 minus 12 gives us 46. So it tells me that my data range is 46. So the next step, I need to select how many classes I want to create. So it means how many categories of data or groupings of data I want to create from this. Usually it's between five and 15. When you work as a statistician, you're going to have to make a decision when you are doing reports at work and all that, you can determine how many classes you want to use. But for the purpose of today's classes, we just, because our data set is very small, we're just going to create five classes. Okay. So now it means we're going to create five groupings, five categories of this numerical data. So we're going to group this data set that we see into five groups. How we define how big the group should be is what we call the class interval. So the size of the group will be determined by the class interval. Now, my math here is not right. Don't say, but how lazy? You're saying 46 divided by five is 10. I'm not saying that. What I'm saying is it will be nine point something. Nine point two, or nine point, it's not even closer to 10. I'm going to explain this. So, but because I need to find a way that I create a clean interval. So 46 divided by five. So I take my range divided by my classes and that tells me that my class width needs to be nine point two. If my class width is going to be nine point two, therefore it means depending on where I'm going to start my data. Let's say I'm going to start my data at nine. I must add nine point two to that line. And that will be, so my, okay, my lowest boundary because I'm creating classes. My lowest boundary will be nine. To create my upper boundary, I must add the answer that I get from the class width, which is nine point two to the nine that I'm going to start with. So therefore my upper boundary will be 18 point two. 18 point two. Then the next one will start from 18 point two, which will be my, which will be my, oh, let me write it there, 18 point two. And I must add nine point two. So my next class, so this is class number one. And this will be class number two. 18 point two plus nine point two will be 27 point, 27 point four. And that is how I'm gonna go on until I create five classes. So therefore it means I'm going to say nine from nine to 18 point two. And then the next one will be 18 point two to 27 point four. As you can see that my class boundaries are not clean enough. It's not easy to work with data like that. And that is the reason why I rounded up and I said in order for me to create a clean boundaries, I need to make it up to 10. So looking at my dataset, if my first point is 10, I'm gonna go back to the same thing that I just did here. If my class is 10, I start with the lowest of 10. Yeah, so it will be 10 plus 10 because I'm going to add 10 to my lower boundary and that will give me my upper boundary of 20 and that will be 20. And I will start the next one will be 20. So my lower boundary will be 20 starting from 20 plus 10 and it will be six, that will be six. So you must look at this. So this is the before and this is the after and you will see the difference. So yeah, my class boundary will be 10 to 20 and 20 to date. Can you see the difference? It's clean, there are no challenges with this. And usually sometimes we use 10.5, 20.5, it will depend on your module as well and how they describe how we define your class boundaries but to create a clean class boundary, that means any value above 10 but below 20 does not include 20. Yeah, it will say include 20 but does not include eight. So it will be a value below 20, below eight. And that's how you will do this. Okay, let me erase all my ink on this light. So now we agree that the 10, I came up with the 10 in order for me to create clean boundaries. Now, since I have set my interval to be 10, my class interval to be 10, so now I can define my categories of those classes by using the class width to create my class boundaries and the limits. The first class is going to be 10 to less than 10, sorry, to less than 20, like I explained how I create this. So if I look at my data, it starts at 12. So the smallest value that I can use will be 10 because I need to make sure that 12 is included in the data set. So it will be 10 to less than 20, therefore it means any count that I'm going to do, it will include 10, any values bigger than 10 and 10. But it should not, it should be any value less than 20 and the next one will include 20, but less than 30. The next one and the next one and the next one. So I have my five classes that I have created based on the data and my last class should also include my last value should be counted in that last class because if it's 58 and I end on 50, it will not be included. So 60 includes 58 because it says less than 60. So I've created my five classes or my five categories. Now it's time to create a frequency distribution by means of taking this classes and counting how many of the values that we have here falls within each and every class. And that's what we're going to be doing. So our class 10, but less than 20, we count how many of those values falls within that. Three. One, two, three, there are three values that falls in the first class. In the second class it says any value but less than 20. We don't have to count those ones that we already counted. So any value that is not including 30 are those ones. One, two, three, four, five, six, there are six. And the next one, less than 40. So I must go to where 40 is. So it's only those ones. One, two, three, four, five, there are five. And then I write the five value there and I complete the whole entire frequency distribution. Percentage, I need to calculate the total. So they were 20 days. If you add all of them, they should give you 20. If they don't give you the same amount as that, then it means you did something wrong when you were counting. So you need to make sure that the sum of the frequency is the same as the number that they gave you in the beginning. To calculate the percentage, remember, it's three divided by 20 gives you 15%. Six divided by 20 gives you 30%. Five divided by 20 gives you 25. Because this is a numerical data, we can calculate those two, which are called cumulative frequencies and cumulative frequency percentages. So what a cumulative frequency mean is, in the beginning, we're going to keep the very same number as we have because they are no other number before that. So the first value of your class, your first class will be equals to the same number. The second value will be the first value plus the second value. So it will be three plus six will give you nine. Alternatively, we can say three from the previous cumulative frequency plus the second class frequency should give us, so that should give you that. Three plus six should give you nine. Same, continuing. Since we have a nine, we say nine plus that. Should give us 14. Nine plus the third class frequency should give us 14 and you continue and continue. The cumulative frequency is the same. Three divided by 20 is 15. Nine divided by 20 is 45. And that's how you complete your frequency distribution table. How then do we use this? How then do we interpret this frequency distribution? Remember now, let's start with the frequency. Three means that there were temperatures in those last 20 days, there were only three temperatures who had a temperature of, there were only three days where the temperature was less than 20 or where the temperature was between 10 and 20. Five means in terms of the frequency, there were five days where the temperature was less, not less than, but was between 30 and 40. Or we can say 25% of the days the temperature was between 30 and 40. How about the cumulative frequency? Cumulative frequency, we say, it helps us to identify and not be able to add, add, add, add, add, but clearly to help us identify. If I need to know how many days, how many days the temperature was less than 40%, how many days the temperature were less than 40%, therefore it means less than 40, sorry, not 40%, less than 40 degrees, it means I'm talking about all those. So I'm talking about all this. So in state of adding three plus six plus five on the frequency distribution, I can clearly see that there are 14 days where the temperature was less than 40 degrees. If they ask me, how many days, how many days the temperature was less than 30 degrees? That's a question, anyone? How many days the temperature was less than 30 degrees? Nine days. There were nine days. If they say, what was the percentage of days where the temperature was less than 50 degrees? It's 90%. It's 90%. So you should be able to look at the data set or the cumulative frequency table and give the answer without you scratching your head and trying to figure out what is happening or what are they asking? So the cumulative frequency tells you if all of them, including where they need you to be, less than 50 will mean all of those and so forth. So let's look at some exercises as well. Oh no, before we look at the exercises, we can take the frequency distribution table. I'm not going to stay long on this one. We can answer questions like this. So yeah, they give you a frequency distribution table with, these are your frequencies, remember counts are frequencies of 30 show pass. So they told us that they were 30 show pass here. The question is, what percentage of show pass spent 1,600 or more? What percent of show pass spend 1,600 or more? 1,600 or more? Therefore it means they say, look at that block because it's 1,600 or more. So in order for you to answer this question, you will say, because there are one plus two, there are three. So you will say your percentage, you can calculate the percentage, but we know that this are 30, all of them, but since I know that I've got two classes, it's three plus one divided by three, which will be four divided by three. So the answer is? One, 13.3%. That will be 13.3% because you also need to make sure that you multiply by a hundred. Let me not put the hundred percent, multiply the answer by a hundred and that will give you that answer. So these are some of the questions that you will get in the exam. They look like this. Here they're asking you, they have given you a cumulative frequency. You still remember how you got the cumulative frequency you added the frequencies. So now they're asking you to answer this question. By looking at the frequency. So they're asking the frequency for the class 10 and under 15 is, which is this section. How do we find the cumulative frequencies? Remember therefore it means you do have frequencies there. Let's go back to our exercise. They have given you this option. They are asking you find the answer, which is the frequency. How did we get nine? We said nine is three plus six. So if I have a nine, how do I know what this value is if they didn't give me this value? Let's assume that, yeah, they didn't give you this value here. You don't know what this value is and they put there a question mark. That's what exactly they're asking you. How do you get that? I say the cumulative frequency minus, so you'll have to subtract the one above. So let's see, you are given this. We know that the frequency for the first one is 10 because it always going to be equals to the same value. In order for us to find out what the frequency of this would have been, it would have been 10 plus the frequency of this should give us that. So that is, that will be 15 minus 10 because it's 15 minus the previous cumulative frequency, which is equals to five, which is option number two. Easy stuff, right? Excuse me, Ms. Yes. Aren't they asking for close 10 to under 15? Oh, sorry. Oh, yes, you are right. I think it's five to 10. Sorry, my bad. But it's still fine. So 15 minus 10 will give us the frequency for the first one. For this one, it will be the same. It will be 21 minus, minus 15. That's five. And that will be number, and number five, which is six because we know that six plus 15 will give us 21. And so forth. So you can be able to calculate the frequency of the whole table as well. Let's look at how we visualize this data from the frequency distribution. So a histogram is a batch hat of numerical data. With the histogram, we take the classes because those are the classes. They are going to be our categories. So it will be those values that you see here are our classes. The values you see at the bottom, those are my midpoints. I must call them the midge. Those are midpoints. Midpoint between 10 and 20 is 15 because 10 plus 20 is 30 divided by two is 15. So these are midpoints, midpoint, midpoint, midpoint. So where the bar starts and where the bar ends are our class boundaries. The height of our histogram, the height of it will be the frequencies. You can see between 10 and 20 is three. So that should be three in between four and five. Between 20 and 30, which is between that value and that value there, said they were six. As you can see, they ate six. With the histogram, there are no gaps because when one class boundary starts and end, the other one starts and it will end when the other one starts. So when the first class boundary ends, the next class boundary starts and where the next one will start and like that. Therefore there will be no gaps on the histogram. Like I explained, the height represent the frequencies or sometimes it can represent the relative frequency or the percentage. With a histogram, you are able to see and this is for both QMI stats for both of you. A histogram can tell you the distribution of your data whether it's symmetrical or normal, whether it's uniform, skewed, and by moda, like that. Those are just the shapes of a histogram. You don't have to learn about some of this, but if in your module, they do tell you about the skewness of the data and you can also use the histogram to see the shape of that data. Later on, when we deal with the variances, when we, next week, when we deal with the variances, then you can talk about ketosis and platachetic or leptochetic and mesochetic, which is your normal distribution. Okay, the other way of summarizing the data from the frequency distribution is by using the polygon. With polygon, we take the class boundaries, we find the midpoint, like I had the midpoint previously, so those are my midpoints, and I use the frequencies to plot the data. It's just plotting the data using the midpoints of the class intervals and the frequencies. An orgif, or an orgyf, or what we call a cumulative percentage polygon. Here we use the lower class boundary, then. So you go and you look at the lower class boundary. Because it's cumulative, at the beginning, it will start at zero because all the values don't start at 10. So there will be zero, but then we move on and look at the percentage as per each class boundary and you plot your data of those percentages. And that is just your orgif, orgif, or frequency cumulative polygon. The last type of visualization that you can do on numerical data is a scatterplot. Scatterplot is those who are doing that. We're going to look at scatterplot when we do regression line, but it's just a graphical demonstration of a relationship between two numerical values. So in the book, sometimes they call it the measure of relationship. It just gives you the relationship between two numerical variables, on the x-axis and the y-axis. I don't have to explain more about it. We'll explain it later on when we look at regression. Any questions? Because we need to move to how we use measures of location to understand numerical data. Are there any questions? If there are no questions, then let's move on. So for statistics, for statistics, you need to learn the properties of all the visualization that we spoke about, the frequency distribution. You need to learn the properties of that. You need to learn the properties of a histogram, stem and leaf plot, a frequency polygon, and a cumulative frequency polygon or percentage polygon. You need to know the properties. By properties, I mean you need to know what an OR give is and what is represented on an OR give. And like for example, with an OR give, we use it to compare two or more number of groups, things like that. You need to be able to know that. When we talk about frequency polygon, you need to know what that means that we use the class midpoint and the frequencies and so forth. That is very important because they can ask you those type of a question. Those in QMI, you just need to know how to calculate them, how to visualize them because the questions might be based on the values you see. They will not ask you about explaining what a percentage polygon mean or a OR give mean, but they will ask you questions relating to that graph, like what is the highest value on this frequency polygon? Then you need to be able to identify that value. So on this one, the value, the highest frequency is six. You should be able to answer questions like that. So different types of asking questions in your exams as well. Okay, so now let's look at how we use measures of central location. Not the variation, we'll deal with the variation later on. Next week, measures of central location, sometimes called measures of central tendency. Sometimes it's called measures of locality. It has many different ways of calling one thing, central location. By the end of the session, you should be able to learn how to describe the properties of the central tendency or central locations, describe not the variation, but the shape of the numerical data, and the others we will deal with them later on. So a measure of central location, there are three of them. They tell us about the distribution or the location of your data or they describe your data. Let's put it that way. They help us to describe the data. And the most commonly used measure of central location is the mean, which is also known as the average. So we can call this, and this is what we use most of the time on average. We always say that. So we are talking about the mean. It's also called the average. And it is calculated by means of adding all the values dividing by how many they are. It's the sum of the values divided by the number of those values. So adding all the values and dividing by how many they are. And it is always added by the outlier. So therefore, it means when you are a statistician and you're doing some calculation, and when your data has the outliers, you need to make sure that you handle them because it can skew the results and give you a wrong picture and make people fright or take wrong decisions. So you need to be able to know how to handle outliers. And that's not the purpose of this session. Those who are doing statistics, you need to know that with the measure of central location, there are two formulas. They is for the sample, which is the statistic. Sample mean, we use the X bar. You need to also remember to know how to pronounce some of these things because in the exam they can ask you what is an X bar. The sample mean represented by the X bar is equals to the sum of the observations divided by the sample size, which is the small n. The population mean, which is the population parameter. Remember the measures coming from the population are called the parameters. The population mean, which is the population parameter is described by the mu, which is a U with a Greek letter called mu, M U. Mu, which is the sum of your observations divided by the population size, which is the capital letter N. You need to know how to identify both of this in terms of the sample statistics or the population parameter. What do we mean by measures of central tendency, which is the mean? This is what we mean. So if I have two, three, five, eight, and nine, these are my numerical values in order for me and this depending whether it's the sample or population, if they say this is a sample, then we use the sample mean. If they say it's the population, then we use the population for me. They are both the same. It's just that how you represent them is different. How you write the formula is different. One is with the small letter N and the one with the capital letter N. One is represented by X bar, X, oh, sorry. One is represented by an X with a bar on top, which is X bar and the other one is represented by a mu. Always remember, for the sample statistics, we always use normal letters that we know. You will notice when we do measures of variation, we will also introduce some Greek letters and normal letters. So for the sample, we always use normal letters that you know, X, S, like that. For the population, we always use Greek letters like the mu. Okay, so let's calculate the mean. So let's say this is for the sample, X bar will be the sum of your Xi divided by how many there are. The sum is two plus three plus five plus eight plus nine divided by how many there are. You need to count them, one, two, three, four, five. There are five. And that is equals to two plus three plus five plus eight plus nine, it's 27 over five, which is equals to five point four. Five point four, correct. That is our mean. So let's assume for a minute that I work in HR and this are the salaries per month of employees in my campaign that I work for. One employee ends 2001 employee, it's not even living wage, a minimum living wage. So they end 2000 rent, 3000 rent, 5000 rent, 8,000 rent, 9,000 rent. And I work in HR and I want to tell the managers or the executive of my company to say, on average, we are good, we paying our staff really good because on average, we pay our staff about 5,400 per month. That is a good picture because then it's in between, it's between the two lowest and between the two highest. So on average, that is how much our staff salary is. But what if in this company now, this very same company, I add that there is this one executive employee who we just hired and this executive employee takes 20,000. Maybe it's the owner of the company, who knows? They take home 20,000. So therefore it means I no longer have, I have now, I said 27, so I can say 27 plus 20 equals to 47 divided by, there are six of them now, divided by six and I will be saying the mean of this company now is 7.8. Then I go and tell my employees, my executive, and I say, ooh, in this company, we are paying our employees very well. On average, we pay a salary of 7,800. Will that be a true reflection of the company? Nope, it will not be because it skews the whole data. As you can clearly see that probably there's two plus the executives are aiming huge amount of money, which gives a wrong picture that says that on average, we are paying our employees very good, whereas we not. Because the poor of the poor will always remain poor, because if we don't increase their salaries, we will always be assuming that they are aiming more, even though they are not aiming more. So as you can see that outliers skews your data. Whereas with the first one, the outlier, where there was no outlier, it was almost there, even though it's still way too high, but at least it balances out. Whereas with this one, it doesn't. So that's how outliers affect this mean. So I'm not gonna talk about this again. Now let's talk about the second one, which is the median. So the median is our middle number, the value in the middle. And if we do have an odd number or an odd value count, the value in the middle between the, which breaks the data set into half, will be our median. But if we have even numbers, therefore we need to make sure that we take an average of the two values that are in the middle to find the median value. And the median is not affected by extreme outlier because we don't care about the outliers we're looking for the middle value. So let's look at this. Let me go back to our 3589. I'm gonna use the same 23589. You need to order your data from lowest to highest when you work with the median. When you work with the mean, it's fine. You can just calculate them and get over and done with it. But median, your data needs to be ordered. It needs to be in an ordered array. So we said the median is the value in the middle. So two values, the value in the middle is five. That is our median. Five is our median. Sometimes it's not as easy as you can see it. So what do we do in that instance? We apply to find the median. We first need to find the median position. We need to find the position. And to find the position, we use a formula n plus one divided by two. This is if you have 30 data set or 30 values. Like we had there, we had 20 values of our initial exercise that we did when we were doing the stem and leaf plot. There were 20 values. You can go from the site and there are five, plus one divided by two, which then gives me six divided by two, which is three. Therefore, this is on the third position. That is on the third position. So I can count one, two, three, the position that is on number three, that is my median. So therefore, that is my median. Now, let's assume now we have two, three, five, eight, nine, and 20. We do the same. There are six. The median position is n plus one divided by two, which is six plus one divided by two, because one, two, three, four, five, six. There are six now, which is equals to seven divided by two, which is 3.5. So I need to go and count one, two, three, 0.5, it's between those two values there. It's somewhere in between the two values. So in order for me to find the median, I need to take five plus eight divided by two, which will be equals to 13 divided by two, which is equals to 6.5. Therefore, if I still go back to the example of HR, it's at least closer, because it's also still not far away from most of them. So, but it's still high, but it says that on average, we paying our employees 6.5. The median is still high, but it's at least better than 7.3 of our median. And that's how you will find the median. So you must always remember that. The other measure of central tendency, it's what we call the mode. The mode is the number that appears more than the other number. The number that is repeated, not the highest number, but the number that appears more than the other number. It is also not affected by extreme liars. And we can use the mode also on the categorical data, because the category that has the highest frequency, that will be your modal category. The same on the histogram, the category with the highest peak will be your modal frequency. If there are two peaks, two highest peaks, then your data is bimodal, the same here. You can also have no mode, one mode, several mode, bimodal, two modals, multimodal, and so forth. What do we mean? If I have two, three, five, eight, and nine. This data of mine, there is no number that appears more than the other number. So therefore, there is no mode on this data set. But if I have two, three, three, five, eight, and nine, I have one mode, which is three. So therefore, three is my mode. There is one mode. If I have, I'm gonna go to the right to write this, two, three, three, five, eight, nine, and nine. I have two modes. Therefore, three and nine are my modes. And this data set, we call it bimodal. Or you can write it one way by modal. If I have two, three, three, five, five, eight, nine, nine, I have three and three, five and five and nine and nine. So I have three, five and nine as my modes. Therefore, I have a multimodal data set. As you can see, you can have one mode, no mode, bimodal or multimodal. And those are the measures of central tendencies. Now, for the next 30 minutes, we'll do exercises. Any questions? Any question? Those who haven't completed the register, I'm going to put the register on the chat. Make sure that before you leave today's session, you have completed the register. I have pasted the link on the chat. So in the absence of questions, let's look at the distribution of the data. Those who are doing statistics, you need to always remember the following. That you can describe your data or the distribution of your data by using the histogram or the measures of central tendency. And later on, we can use the measures of variation and later on we can use the interquartile range. But for today, you need to look at how your data is shaped if your mean is less than your median, then we say your data is left skewed. If your mean is equals to your median, then we say your data is symmetric. Or we say it's normal. Or we say it's symmetrical. If your median is less than your mean, therefore it means your mean is greater than the median. If your median is less than the mean, then we say it's right skewed. Therefore it means the tail. When the tail is to the left, let's start there. Tail, let's call it tail to the left, it's left skewed. Tail to the right, this tail, if you have this tail to the right, it's right skewed. Tail to the left, left skewed. Tail to the right, right skewed. Always remember that. Otherwise, if you're using the measures of central location, if the mean is less than the median, it is left skewed. If the median is less than the mean, it's right skewed. Okay, consider the following data from the sample. As you can see there, they say the sample. So therefore it means the formula you're going to use this X bar. From the sample of 12 monthly sales of bicycle sold by a bicycle dealer, and those are the data set. Calculate the mean, find the median position, and find the median, and find the moment. Remember the median position, we use n plus one, divide by two. You have five minutes, and then we will, you can also post your answers on the chat, or when we come back, someone needs to tell us how they calculated. So we can have three people giving us answers. So I'm gonna give you five minutes. And your five minutes, that's right now. Are we winning? Let's see. No answers on the chat. Remember you can also post on the chat. Okay, are we done? Anyone, anyone who wants to show us how they calculated everything. The mean, the sum of all observations divide by how many they are, right? That is the formula, okay? 92 divided by 12. 92 divided by? By 12. By 12. Seven comma six seven. Seven comma six seven. And the median. We need to first find the position. Okay, it's 12 plus one. 12 plus one. Divide by two. Okay. Six comma two. Six. Comma five. Six comma five. But we first need to also sort the data. So how did you sort the data? Five, four plus six. Quite triple six. So there are four of them. And seven, eight. Seven, eight. Double nine. Double nine. Triple 10. Triple 10. One, two, three, four, five, six, seven, eight, nine, 10, 11, 12. They are 12 of them. So now let's go find the median. We count from six point five. Up to six point five. One, two, three, four, five, six. Point five will be between seven and eight. So the median will be seven plus eight divided by two. And that will be equals two. Seven comma five. Equals two. Seven comma five. And the mode. Which is the number that I'm just, more than the other numbers. Six. Six. Let me see on the chat. From one delay you will have eight point five. So let's see if we counted right. Nine, two, three, four, five, six, seven point five. It's between those two. Yeah. I did it between eight and nine instead of seven and eight. So there's my mistake. Any questions? Any questions? If there are no questions. Okay, sorry. Let me check also the chat. Manesa, yeah. You mean, I see there you have 992. So probably you didn't divide by how many there are. Remember, it's the sum of all of them divided by how many there are. You know, say our event says it is seven point six seven. What did we write there? Oh, seven point six seven. If it's two decimal, then we need to round it off to two decimal. If they say to one decimal, you just need to round it off to one decimal. Let's look at the next one. Use the information below to answer the question. We are given the data set, which is 15, seven, 10, 17, 30, 67, triple one and one or two. The score in the data set that occurs with the greatest frequency is known as. We don't want the number, but we want to know if we have a value that appears more than the other numbers, is the number that appears with the greatest frequency is that number one, number two, number three, number four. Number one, number one will be the answer will be the mode. You guys, you are rocking. Yo, you know states. Using the same data set, what is the median of this data set? Remember, you need to sort the data first, find the median position, and then answer the question. Is it 23.5? The median. Median. Yes, it's 23.5. Okay, how did you get that? We don't want just the answer. Remember, you need to tell me what was the median position. First, we need to sort the data. Let's sort the data. About seven, 10, 15, 17, 30, 67, 111. I'm sorry, 102 and 111. One, two, three, four, five, six, seven, eight. One, two, three, four, five, six, seven, eight. You've always need to double check your information as well before you do anything. Because they asked you about the median, so let's go find the position first. Median position, 10 plus one, divide by two. Like we said, we're using the human error prompts. Remember, first collect information that is relevant to the question, and then start calculating. So we identified what the formula is. N plus one, divide by two. And also median, this formula will not be given to you in the exam. Oh, anyway, probably you are writing an open exam or something like that, I don't know. But you need to know these formulas because from now on we're going to introduce a whole lot of formulas. Some of these things, you need to have a formula sheet where you highlight and identify them so that they are easy to find and easy to use. So they weigh how many? Eight plus one, divide by two. What was the median position? 4.5. It's 4.5, so we're going to count. One, one, two, three, four, point five. It's between 17 and 30. So our median is between 17 plus 30, divide by two. Our median is? So it's 47 over two, which is 23.5. Which is 23.5. And then you can come and check your answer. Remember, do not skip steps. Only in the exam you can use shortcuts, but while you're still practicing and exercising and learning some concepts, try and write down everything, all the steps. Okay, let's look at number four. I think there are two last questions and then we'll be done. Calculate the mean, the median, and the mode of the following data. 190, 104, 135, 314, 179, and until 131. Remember the mean? We're going to assume that this is the sample. You can use the mean is the sum of all observation divide by i. You need to also remember for the median, you will need to find the position first, but we'll give you the position. But first you need to, when you calculate the median, you need to sort your data. I'm gonna sort the data for each one. I'm gonna sort the data for each one. You calculate the mean. Skip to one number three. Are we done? Yes. Okay, same as if in the chat, you guys are on the roll. So many and as I see, I see, I see, I see. Okay, so let's get to that. The mean is the sum of how many there are divide by m. If you add all of them, how much do you get? 1,671. 1,671, divide by how many there are? 10. The answer is, am I writing it right? Happiness. Yes, yes ma'am. Our median position, we've sorted the data because I wanted to save time. Our position and plus one divide by two. There are 10 of them, plus one divide by two. 11 divide by two. 5.5. Which is 5.5. And our median, one, two, three, four, five. Five. Lies between two values. So our median will be 146, plus. 11 divide by two. 316. 316. 16 divide by two. 158. The last question is, what is the mode? No mode ma'am. No mode. No mode. Yes. I can hear someone even giggling. So it means there's a light at the end of the tunnel. So you now understand the measures of central tendencies. Okay, I think this is the, I do have two more, but the last one, I'm just gonna talk you through it. It's not like we can answer it now. So let's look at the last question. The following data represent the number of children in a sample of 11. So in stats, there are key words that you always need to remind yourself. So like, yeah, they tell you new samples. So you know that you're going to use the sample statistics somewhere. And your N is always your small N in terms of 11 is your small N from a certain community. And they give you the data set where some families have two children, zero children, four children, one child, no child, five children, one child, one child, four children, zero child, and two children in that family. Okay. So since you know, you're writing multiple choice questions and with multiple choice questions, sometimes things are very tricky. Yeah, they're asking you which one of the following statement is correct. So we need to find the correct answer to this. The distribution is positively skewed. Do you still remember how do we find that the distribution is positively skewed? We can use the measures of central tendency to find that. So therefore it means for a positive skew, I didn't talk about that. Therefore it means we need to find when it's positive, the tail is to the right, is to the positive values. So this is the same way as saying, is it right skewed? And I hope you did write all those things. So when it's right skewed, it means, when it's right skewed, it means the median is less than, is less than the mean. So it means we need to calculate the median, we need to go and find the mean or we need to calculate the mean, find the median. Question two says calculate the median. This one says the mode is equals to the median. The mean is equals to the median and the mode. So it means, yeah, we need to find the measures of central tendency. So let's firstly quickly sort the data. Zero, zero, one, two, three, zeros. And one, two, three, ones, one, one, one. And one, two, two twos. And one, two, four, and one, five. So I've sorted the data. Let's go. Let's first calculate, let's find the mode. It's easy. Mode, what is the mode of this question? Which value appears more than the others? Zero, one, two, four. So we have zero, zero, and one, it's a bimodal. So we have the modes, it's zero and one. Okay, let's calculate the mean, which is also the easy one to calculate. So the mean, sum of all of them, sum of all of them, divide by how many there are. So add all of them. It's 1.82. I want all of them. Eight plus, which is 20. Divide by one, two, three, four, five, six, seven, eight. 11. 21. 20 divide by 11, which is equals two. 1.82. 1.82. Let's go find the median position and plus one, divide by two. We know that there are 11, plus one, divide by two. 12 divide by two is six, right? Six. So the median, we need to go and count. We use the certain data, right? We're going to use the certain data. One, two, three, four, five, six. So the median is one. So now I've got my data. I can come to the question and check every statement and see which one is correct. Okay, so let's start with number one. Let me change my pen color, since I've been using red. I've been shouting the whole time. So now let's go and use a color that is not shouting. Let's use blue. Okay, so the first question says the distribution is positively skewed. So what is the mean median? Median is one. What is the mean? Mean is 1.82. So the mean, the end is less than the mean, which is correct. In the exam, you don't have to go through all of them. But for the sake of practicing, you need to go through all of them to see if you understand. The median is five. We know that the median is one. So that is incorrect. Only the mode is equals to the median. Mode is zero and one. The median is one. That is not true because the mode, there are two of them. If they would have said only one of the modes are equals to, but they say only the mode, but we know that we have a bimodal. So it cannot be. Only the mean and the median are equal. The mean and the median are not equal. So it's not correct. The mode of this dataset is equals to one. The mode of these datasets, there are two of them. So it is not correct. And that is how you will answer the question. When you are practicing in the exam, the first one you see and you have done everything right, that is your correct answer you move to the next one. But while you are practicing, please go through all the statement so that you understand how the questions are asked. Even when you are writing your assignment now, go through all the options to make sure that you understand because they teach you something. So we say they help you to learn as you are doing. Okay, we have ran out of time and this is one of the other typical questions that they give in the exam. They will give you a stem and leaf plot, but they will ask you questions like find the median, find the mode, find all that. I'm not going to go through this at the moment. So, but you can go and look at some of these type of questions. It can be your exercise or let's call it the homework. If you have any questions, remember you can find me on WhatsApp. When you post questions on WhatsApp, please don't give just the answers. As you can see that we have gone through the activities and exercise by showing how we got the answers, please do that and post your question so that we can see where you went wrong and then correct you. Okay, so take this screenshot of this and then you can do it at your own leisure time. Otherwise, the notes are posted on the links that I gave you the other time. Unisa still has to post the notes on their website as well. Let's quickly recap. We've lent to organize the tasks by using numeric questions. Can I have yours? I will give you just now. Let me just do this and then I will stop the recording. We have lent the how to organize data using numerical data by using the audit array frequency distributions, the medleaf plot, histogram, polygon and cumulative polygon. We also lent how to calculate the measures of central tendency. And that is what you have lent today. Any question, any comment, I will post the links to the WhatsApp groups and all that just now. Any comments and question? Yes. I just have a question. Can you go back to the homework? I just have a question regarding the stem and leave. So with the stem and leave, I see that when we start on with our stem, we can skip previous. So you can skip two, one and yeah, two and one. But as you go down and there's no data for eight, you still have to write eight. So you can't write six, seven and nine. You have to put eight there. No, you don't have to. It's just on this what they included it. But you don't have to write it because if you look at this as well, there is no 90, 91, 92, 93, 94. They just want to confuse you. So you can skip eight and this five can also move and be closer to the line. It doesn't have to be this fact. Yeah, the leafy sign was just the stem. I wasn't sure if you have to follow or you can skip numbers if there's no data for that specific stem. Yeah, so you can skip that. Thank you. All right, any other question? Lisi, can I have your WhatsApp number, please? I will do, I will, I first need to deal with the content that we just went through and then I'm going to stop the recording and then I will stay on, stay on. Any question relating to what we have just went through? If there are none, just give me a sec to close off the session. Thank you guys for coming up. Remember, if you have any question, you recommend regarding administrative issue, where are the notes? Why are they not there? Where is the recording? You need to send an email to CTNTAT. I'm going to post it also there, CTNTAT at unisa.ac.za. That is the email address you need to post to. If you want to send me something privately, not on WhatsApp, you can use my email address. It's eboyem at unisa.ac.za. I will also add it to your email. But if it's a long thing dealing with content, please copy CTNTAT at unisa.ac.za. They also want to know if I am assistant student outside of this environment as well. So you need to also copy them. And enjoy the rest of your weekend. Please hold on. Don't go, I'm closing off the session. Excuse me, sir. Is justice? There are many M.S.