 I apologize for all this for 16 minutes of technical issues that we are experienced. When you join, switch off your video and mute yourself. And today's session, we're only going to do chapter one and chapter two. And those are the topics that we're going to be discussing today. We'll just run through the basic concept, it will be quick, and then we'll look at the types of variables. There will be some exercises on there as well that you are able to engage with and do them, and they will just prepare you to be able to do your assignments as well. And we can take, I don't think we will take a five minute break now because I'll waste 20 more minutes. So we will just continue straight through, but I will see we might have five minutes break because it's not useful to have any longer screen interactions. So we just need to always pop away from the screens and then we will end up with the visualization. If we don't finish, it's fine. We can also start again on Friday where we left on Saturday where we left off and then do the rest of session three on Saturday. Okay. So we're going to start with the basic concepts of statistics. When I realized my mic was muted, I asked the question, can anyone tell me what do they think statistics is about? And they know which areas use statistics in everyday life on almost every day we hear about them. So can you give an example as well when you explain if you know what statistics is about? Just one, two people just to gauge their understanding. What do you think statistics is about? You can unmute yourself and say it out loud. Nobody. That says the measurement of data is there. Sorry. Measurement of data. Measurement of data. Okay. It's very simplistically. Okay. Nothing. Okay. So since there is nobody who wants to say, let me unpack statistics for you. So by the end of 30 minutes, now it's going to be by the end of the last 15 minutes that we have on this, you should know what statistics is. What are the concepts that we use in statistics? And you should understand the types of variables and the different levels of measurement. And later on, we will look at the visualization of these variables that we're talking about. Okay. What is statistics? The statistics is a way where we get information, which is data that lies in multiple sources of information somewhere on Excel spreadsheet, or in the databases, or in our CRM systems. And we take that data, we enrich it by doing some calculations, by putting it in calculating some averages and so forth, and then visualize it in nice tables and charts and forms and present it so that people who make decisions can make an informed decision based on the data that they see. So in a nutshell, statistics is a method where we transform data into useful information for decision making. And that's all what statistics is about. Why do we study statistics? So let's say we want to develop an appreciation of variability and how some products are affected by the price changes and so forth. Or if we want to improve the processes, or we want to invest in more systems or improve the systems that we have, then we can use statistics to help us with that. We can also use it to estimate the present and also predict the future, which means we can forecast like you can see now with the coronavirus where people are talking about the data for today, but they can also tell you that they predict or they focus that the numbers will rise or the numbers will drop or things like that. And that's part of why statistics. The areas, they are very different. Some people do statistics in econometrics, some people do it in biostatistics, some do it in so many areas. For example, I work in the business intelligence department at the university, so I use statistics on my daily basis to help the executive to make decisions. We use the present data and we use the present data to forecast or predict the future. We want to know how many students we want or we want to grow in the university with how many amount of students we use statistics to create those forecasting models. We use statistics as well to understand some basic ideas of the statistical reliability and stochastic processes, which include also calculating some probabilities and chances to see what's going to happen in the future. For example, what are the chances that when we open up, I'm going to use coronavirus and the lockdown processes. What are the chances that when we open up the alcohol and tobacco that the numbers of cases of coronavirus will rise? All those things, they are put through some of these cases where they test them and they check the probabilities. We use statistics also for very important or statistics is very important in every aspect of the society, including government, businesses, even in our daily lives. If you know statistics, you can apply it in your everyday life. You can even calculate the average amount of money that you use on grocery over the years so that you can improve on maybe reducing on some of the things that you don't need and buy the things that you really have to buy. We can use statistics to do a lot of things and mostly statistics we use it to solve problems because if there are problems we can use statistics to solve them, but it doesn't solve problems. It just highlights the areas of concerns so that the decision makers can make changes so that they can solve those problems. We define statistics as we take data and we turn it into information in order for us to build new knowledge so you can build totally a new knowledge out of statistics or out of the data that you connected. Where do we use statistics on our on an everyday life and why is it important? We use it also to cure diseases. So for example, like with the coronavirus, we can predict where the virus is going, where it comes from. We can check and test the data and do the confidence levels and do the prediction about the coronavirus. We use statistics in politics as well. So if we want to know people run surveys during the election to see which party might win the election and this can sway decisions. People make decisions based on what they see or observe. So with statistics, once they put it out there for people to engage with, then people can make up their mind about which party they will prefer to choose because they can see the majority of people are going to choose that party. So we can use it as well to influence ideas and decisions. We use statistics as well to focus the weather. So when you hear, when they say, so these are people who work in the weather centers, so they use statistics to predict whether they will be rainfall and so forth or whether it's going to be cold or the cold front is coming and how what is the percentage of the wind that is going to come and so forth. So that is statistics. It's very important. So if you know this and you can use it on your day-to-day basis, it means you can solve as many problems as you can. So while we talk about statistics and we're talking about summarizing the data, therefore it means we need to understand that with statistics we are able to describe data. And when we describe data, it's data that we would have collected. So therefore there is a branch in statistics. So you will see, we will be discussing two things. We will be discussing one branch which is the descriptive statistics and we're also going to discuss the branch which is called the inferential statistics. With descriptive statistics, it's a method where we collect data. We summarize the data and we collect it via surveys, questionnaires, from the CRM systems, and then we summarize it by using tables and charts or even the measurements like calculating the averages and the medians and we calculate the standard deviation and the variability of your data. How far apart your data is from the mean. And the other part of statistics is your inferential statistics which talks about the inferring your data to the population that you are studying. So making decision about the population you are studying from the sample and we use for that purpose we can do the estimation where we estimate what will be the appropriate population weight that we can assign. We can also use what we call the hypothesis testing where we test one view against the other. So we might say we know that the average mean of people who get infected in the western Cape is 300 per day and we can do a hypothesis test which will say the opposite of that. That will say it's not true. The average number of people who get infected or who get coronavirus are less than 300 per day. So we use hypothesis testing for that to test. And like I said we infer back the result to the population. It's just by using all this other method we draw the decision about the live population that we would have collected data from from a small up group which will be your sample. Okay so why don't you understand those two key concepts of your your branch of statistics which is your descriptive statistics and your inferential statistics. You also need to understand because I've raised some of the things they I've talked about the population. I've talked about the samples. What what do they mean? What was I talking about? So a population it's a set of all elements or items to be studied. So everybody so let's say I want to study South Africa. Everybody who stays in South Africa whether you are foreign or you are here as an asylum seeker or you are here as a South African and so forth. As long as you are in the boundaries of South Africa you will be part of my population and because I want to study what's happening in South Africa. But since it is so huge and it's very expensive to study the whole population it is best to select a subset from the population and that subset we call it the sample. Before I move to the sample when I select the survey or let's say for example the census. Census is the study of the entire population of South Africa. So when they do the census when they collect information from the people they ask questions who are you where do you stay those kind of questions do you have water do you have electricity do you have a flushing water toilet do you have this all those things when they ask. After they collected that information and they go and start summarizing the information when it comes to analysis when they calculate like the mean the standard deviation and the variance. They are calculating what we call the parameters. Those are the measures that are used to describe what is happening with the population and call those the parameters. So since the population is so big and we cannot calculate the parameters for every individual person in the South in South Africa therefore we select the sample and the sample is just a subset of your population is just a small group. There are techniques that are used because sometimes you need to use a probability sampling method which means the data that you select from the population you can in fabric your results to the population sorry the the data that you select from your population and you create a sample you do the analysis you should be able to in fabric those the results to your population. But if you use a non-probability sampling method which like plus sorry which is like convenient sampling and so forth you cannot in fabric the result that will just be an opinion about the sample that you have selected. Lucky enough in your module you do not have to worry about the process of selecting a sample from a population all you just need to know is what is a population what is a sample and also what are the measures that you select from the population that are called statistics. So when you calculate those means the variance and the standard deviation we call those measures statistics. So those are the key concepts that you need to know about what the population is and what are the measures that comes from the population and what the sample is and what are the measures that come from a sample. Okay do you have any question? I know that I said we're going to be interactive but it feels like I am talking too much. Okay so if there are no questions if everything is clear then we can move forward. Okay and that is your exercise. Since everything was clear there is your exercise I'm going to give you only two minutes one minute to think about it and then we're going to have a discussion about it. So in a hospital seven randomly selected patients have the flow the following blood types O, A, B, B, A, O, O, N, A so those are the blood types. In that case identify what is the population from that after you have identified what your population is identify what your sample is you have one minute. Okay you can unmute yourself if you want to give an answer what will be the population of this study? So the population is seven and the sample four wants to do you agree and if you don't agree what is your option anyone she's saying the population is seven and the sample is four do you agree? No so the population would be everyone in that specific hospital all the patients in that specific hospital and the sample would be the seven randomly selected patients. Population is all patients and the sample will be only seven selected. That's how you define your population remember the population is a sub it's all elements of interest everybody in the hospital so it will mean all the patients so because yeah we're talking about patients so it will be all the patients that are admitted in the hospital or that are in the hospital and the sample will just be only the seven that were selected as long as you see a weight randomly selected it means that means they have selected a sample. Any questions? Before we move to types of variables and we are right on time. Okay if there are no questions remember if if you're getting lost and you don't understand this is your chance remember on my UNISA we can't even talk so this is your chance to raise your voice and say when you are lost because this is the only opportunity in your lifetime that you will receive that UNISA allows us to do online sessions because usually they don't allow this. Okay when we continue with the key concepts we need to understand the types of variables so we when we recap on what we did just now we spoke about the population and we said we pick we select things from the population and we measure that and those things when we measure them they become what we call the parameters or we even go and select them from the sample and we measure them and then they become statistics. What are those things that we're supposed to be collecting in order for us to measure and those are the things that we're going to be discussing right now? A variable. A variable is a characteristic that describes an item or an element or an individual and a variable is something that you can observe or you can measure for example I said you can go and collect surveys about or let's say the census so when they collect information and they ask you how old are you that is a variable it describes how old I am. If they ask me are you a female or a male it's a variable called gender they're going to ask me my gender. This day we no longer even talk about gender we talk about sexuality because there are so many sexuality people identify themselves differently this day so they will ask you about your sex and you can say I identify myself as a female or a male those kind of identification those when we talk about sex or gender or income group those are variables because they are characteristics that define an item or an individual thing. If I buy a pen that pen if the color of that pen is red that color is a variable because it describes the type of the pen that I'm buying the color of the pen that I'm buying. Okay for example I've been saying you ask me my sex or my gender then I tell you that I'm a female or I identify myself as a female the minute I say I'm a female that is a data it's what we call a data point and a data is a set of values that are associated with the variable then my agenda will just be your variable and the value that goes with it with the variable will be either a male or a female like that and that is what a data is about for example like with the color of a pen a color whether it's black green red those black green red are what we call data and that's the thing that we summarize we use data to summarize the information since we spoke about the variable we define nicely what the variable is like we said the variable is gender we need to also understand what is this gender what type of a variable it is is it measured or is it observed remember I said a variable can either be observed or it can be measured so yeah we go into unpack that so there are different types of variables that we get we can get a categorical data oh sorry a categorical variable which produce categorical data so you will constantly hear me interchange the two so a categorical variable is a variable that can be placed into categories and we can also call it a qualitative data because of the quality part of it so it defines the quality which means the color can define what quality of a pen is these in terms of the color oh let's put it in a nice way and an extendable way a category because we can categorize the color of the pen in terms of if I have lots and lots of pens I can group them based on their color so I can put them into the color category like that then we also have what we call a numerical data and numerical data is data that can be measured or can be counted numerical data can be measured or it can be counted and we can also call the numerical data a quantitative data a numerical variable we can call them numerical variable or numerical or quantitative variable I'm used I'm used to using quantitative data qualitative data now I'm going to talk about variables so that we don't get confused as well so a numerical variable is data that is variable that you can count or you can measure and you call them quantitative variables the variable that can be contact like how many number of children do I have I can count the number of children I don't have a half a child I have a whole child so I count them there will be one two as long as a van can take a whole number it is a discrete value only if the value or if the variable or the data point is a whole number data point then it is a discrete variable if it's a whole number the discrete variable if the variable contains data points that are decimal in nature then we say we are counting them because we can use a measuring tape to measure the height we cannot count how tall you are we can measure you by using the tape or the measuring state we can measure the temperature while we still add that so continuous data is data that we measure discrete data data that we count so discrete how many number of children do I have I can count them continuous I need to get the measuring tape to measure my height or I need to get a scale to measure my weight because I cannot count now for discussion what about age who can tell me what age is is it a discrete or is it continuous what is the variable age it's continuous why um well because like you can measure age in like days or weeks or months or years thank you very much yes because if you are a female and you already gave birth to a child you will know that when your child is born when your child is born they tell you oh you have a beautiful son was born at 1206 the acquaintance already your child was born in a continuous variable manner he was not born at 12 for 12 which is just a whole number was born at 1206 which is which content they they are just so lazy to also capture the seconds because then I don't know at what second do they record but they record the minute that the baby is born and say yeah what's crying and then they record that that the baby is born so they record the minute the hour the minute and the second so your age is continuous and there is your other exercise since we understood what types of variables are now same exercise that we had previously in terms of the data you need to identify here what the variable is you need to identify what the data is and you need to identify whether the data that you are or the variable that you have identified that you have identified is it numerical or is it categorical you have one minute not gonna take you long okay you can unmute and let's have a discussion what is a variable I'm not muted so what is the variable I would say it's categorical no I need to know what the the variable is from yeah oh um that light type the variable is blood type type wait sorry I just realized something is the meeting recording okay I am here okay so it's recording and what what is the type of data what is the data the data would be the seven patients no the data is those the blood groups the blood groups the OABBAOO are what we call data remember the data is the values that are associated with the variable then if we know what the data looks like what is this variable is it numerical or is it categorical categorical it is a categorical data because we can put it into categories by grouping the blood together we cannot numerical data we can apply the mean the median we can summarize it in that nature whereas categorical data you cannot calculate the mean the median and the standard deviation and so forth so number two which then will lead us to that five minute break which of the following variable is not a categorical variable you have height of a person gender of a person achievement of score or achievement score of grade 12 learners as high average and low choice of whether the test item is true or false just look at that and then we will go through each statement and identify the type of variable for each statement just now so let's go through each statement the first statement is this categorical or numerical numerical is numerical because the height you can only measure it measure it so this will be a numerical data and the second one uh that's categorical we will get back to our question just now this is categorical the next achievement of scores as high average and low categorical categorical when you look at the first part of the question it says achievement of scores usually your score will be a numerical but because of the edit information that extends it makes this a categorical categorical variable and the last one the choice of a weather a test item is true or false it's also categorical so you can see which one is not therefore it means our number one is our correct incorrect and which which of the following variable is not a categorical variable it will be the height of a person so in terms of the description oh wait sorry uh let's say ladies first yes sorry yes so in terms of the descriptors of your your variables is it always either categorical or numerical so it's always an option of those two yes it will either they will either say categorical or numerical but remember as well at the later stage you will see we will be adding the levels of measurements so they will might say numerical this variable numerical or remember as well they might also say not only numerical but they can call this a quantitative data remember that net all right remember that it can be a quantitative data they can interchange the two weights for categorical they will say qualitative data for qualitative variable so they can use either one of the weights okay so sorry so are you saying your quality your qualitative is also categorical numerical is quantitative okay all right got you okay and the gentleman you were asking a question yeah I wanted to ask a question about the but you touched on it when I was talking about the levels I wanted to talk about the numerical part the first one whether it is a discrete or it is a continuous but you said you were going to look at it that's what we're going into now thank you so much okay so now since we understand the two variables qualitative quantitative or qualitative categorical quantitative numerical variables let's learn understand the levels of measurements since we said categorical data is data that we can put into categories data that can that does not have an order or rank we call it a nominal level of measurement so for for qualitative data or categorical data there are two scales for two levels of measurement and here as well you will hear me talk about scales of measurement or levels of measurement they interchange them so sometimes they say scales of measurement sometimes they will say levels of measurements so for categorical data we have two levels of measurements we have a nominal and ordinal so for the nominal data it's data of nominal variable variable that can be placed into categories but they it does not have a logical order or rank and you cannot use it for calculations where you calculate the median the standard deviation and so forth it can be used in comparison because you can compare amongst the group to see how many males and females are in the in that group you can use it in that manner but not in a mathematical comparison way but like I just already gave an example like gender race there is no order political affiliation there is no order on which one is higher than the other there are so many other types of nominal data out there like types of cars or the manufacturers or the type of cell phones or the manufacturers of cell phones or something like that so they are different so you can own a Nokia, Samsung and Apple like that there is no order okay then we have another type or another level of measurement which is an ordinal also a categorical variable and that variable it has an order or rank to it for example if you walk in into a bank they always have those service teller rank my services thing there and they ask you to rank from zero to one or when you call GSTV sometimes they will say hold for the survey at the end of the call and they ask you can you rank the operator that helped you from the level of scale from zero to ten because it talks about zero being poor or not being able to help you and can be excellent so there is an order in which you rank that level of service and that is an ordinal scale of measurement or level of measurement similar to nominal you cannot use it in any calculations but somehow some way we try to use it to compare things so but in in your first level module you just need to know that you cannot use it in any calculation you cannot calculate the mean the median and the standard deviation but you can use this as well so this one you can use it to compare because you can compare and see how people have answered the highest on the favorable scale or on the lowest favorable scale things like that and you can put it in order because you can order them from zero to 11 which means if it's a five scale maybe possibly they're asking you if you agree or disagree they will say strongly agree agree and don't know disagree and strongly disagree and you can rank them and when you visualize them you also put them in that order and we can use this in so many other things like the dress size like level of satisfaction education level or even your rank at or the position at work they are ranked or the salary scales they are ranked things like that and those are oh actually when I move to yes to salary scales which are ranked maybe you are in A or B or C or D something like that some companies use letters some they use numbers the peronomous scale uh those you can put them in order so we call those ordinary and this is only for categorical data for numerical data we have two scales which sometimes interchangeably there is one scale that doesn't feature at all but for the purpose of your module there are two scales which is the interval and the ratio for interval levels of measurement it's an order scale which it shows the difference between the two measurements and the the measurements have a meaningful quantity but it does not have a true zero point what do I mean by that I saying it does not have a two zero point for example there are only two or three things that can be categorized in terms of interval scale since they do not have a true zero scale or a true zero point it means zero is just another number temperature so how hot it is when you look at the temperature the temperature can go into a negative degrees so like I said there are very few that uses this uh that you can assign an interval scale to temperature your bank balance where it goes into into is it debit or credit way negative um what else goes to negative uh does the C level goes to negative when you are above ground it's positive and when you go down C level it becomes a negative number I don't know so any number that any numerical value that can assume a negative and a positive value it does not have a true meaning of a zero because zero is just another number like any other number that define that it's it's called when it's a temperature things like that and the other measure or level of measurement is what we call the ratio and a ratio like with the interval you are able to calculate the difference between the the the measurement because you will get a meaningful answer from there because you can calculate the distance between home and and and the church or home and school you can calculate that distance and it has a zero point the meaning of a zero is there because if for example you you traveled zero distance it means you didn't travel you you didn't move you haven't gone anywhere you didn't travel so zero means nothing like you don't exist it doesn't exist it didn't do any it didn't become and that is the true meaning of zero that we're talking about for example like um uh which which value can have a true meaning of zero so your weight will have a true meaning have a true meaning of zero because if your weight is zero it means you don't exist you nobody can have a zero a zero weight your height cannot be zero because then it means you also don't exist or a building cannot have a zero building therefore it means it was never built doesn't exist your age cannot be zero because then it means you'll never exist things like that so uh those that have a true meaning of zero we call them a ratio so you should know the difference between the two and how to classify the two variables in terms of the levels of measurement okay so now i'm going to flash the uh the statement anybody can say what that statement is you don't have to discuss it we don't have to worry about discussion so i'm flashing it and then you say it's nominal or ordinal since we understand what nominal and ordinal is categorical data takes up from a level level level of measurement of nominal or ordinal nominal no order or rank or natural love ordinal there is an order or rank interval and ratio as categorical or quantitative interval there is no true meaning of zero ratio okay is a true meaning of zero it means zero means something zero means you don't exist okay weight of a left watermelon in stone is a nominal ordinal interval ratio ratio time of day when it's morning after evening night nominal morning evening night internal remember uh time of day what is this ordinal? is it categorical or numerical? it's categorical it takes only two days nominal or ordinal so this is ordinal because it's morning in the morning it goes to the afternoon there is an order of how the gate will see so this is ordinal because there is an order there distance from your place to the nearest five grocery stores ratio there will be a ratio it's a ratio airplane companies saving at a given point it's nominal it's nominal it is nominal because the airline companies is like software, BA, Kulula, Mango, SAA so they will just be a nominal okay so any question? while we have 15 minutes to complete what I need to do so I'm going to ask and plead can we extend our time since we ran into problems so that we can finish today's session I think by quarter two we will be done quarter two nine we should be done silence means you agree with my statement we will finish at quarter two nine thank you yes we do you are quite such a wonderful group the place is in a ranking of chest players first second third and fourth and this is ordinal okay any question then we go into visualization which is organizing data okay if there are no questions and you are happy we can move into visualization we will start by looking at how do we visualize qualitative variable because it's easy with that in between the slides I have some exercises I might skip those exercises so that we are able to finish but you you will have an opportunity to go onto my UNISA to do lots of exercises that are based on the content that we have especially on study unit one and and study unit two I'll open it up today actually when we're done I will open up study unit two as well so by the end of at quarter two you should be able to know how to visualize or construct tables and charts for numerical data and also for categorical data so since we know what a categorical data is is data that we can put into categories so if it's something that we put into categories there are very few things that you need to know in your module you can either create a frequency table which is also called a summary table call it a frequency table it can also be called a frequency tabulation or table or you can you can create a graph which can be a bar chat or a pie chat you do not have to worry about the Pareto in your module you don't learn what the Pareto is so those are the only three things that we're going to cover now so visualizing categorical data a summary table looks like this is just a table that shows you the categories and show you the percentages it just indicates either we can use percentage or we can use frequencies or count so we can also create a frequency or count and you can you don't even have to discuss anything so in your module they don't expect you to discuss or explain what 38 percent means so all they just want to know is do you know the properties of the table do you know that you can only use categorical data to create a summary table and a summary table is made up of categories so your categorical variable which means of categories and you can use the count or the percentages that's all we need to know like I said there will be some exercises in between I am just going to do the exercises for you so let's say in the in in the exam for example because I didn't explain how you get the categorical how do you get the percentages so in the exam they give you this table and they ask you because these are frequencies or or count they ask you what is the relative frequency relative frequency what they mean the it is what is the percentage the percentage therefore it means you have to calculate the percentage of this table calculating the percentage of this table you need to create a total column oh come on you will create a total column so in the exam you will go as quick as possible because you will want to to answer the question as quickly as possible but for these papers so we go and we say 160 plus 246 plus 94 and we say it's equals to 500 anyway I didn't even have to go and calculate it because they told me there that the random sample is 500 I should have known that the total is 500 now they're asking what is the percentage of a and c a and c has 160 so I will say 160 divided by the total which is 500 and that will give me my percentage 160 divided by 500 and that will give me 0.3332 so when they ask you about the relative frequency they will be asking you to give it as a a percentage or they can ask you to give it as a decimal because on here the answer is 0.32 which is your relative frequencies and if it was a percentage then you will multiply this by 100 and your answer will be so this will be a relative frequency and for a percentage you will do the multiply by 100 and that will give you 82 percent so you know 35 but 30 82 percent okay and that will might they might ask you in the exam the other method of visualizing a categorical data is what we call a bar chart so if you look at the bottom there they we have a bar chart and a bar chart is just a bar which represents your categories and the height of the bar represent the frequency and sometimes they use the percentages one so the bars this are your categories the height represents your frequency or your percentage and that is the the basic you need to know about the visualizing the categorical data and the other the other property you need to know is the bar chart has spaces in between because the bars will never touch so they are spaces in between or the bar chart the other type of visualization or graph that you can create is a pie chart and a pie chart is broken into slices and the slices represents your category and the size of your slices they represent your percentage and sometimes they can represent your frequency or count like I said you uh you don't even have to worry about what the Pareto is but the Pareto in case in future you just want to know is just a numerical plus a categorical data but when we look at the numerical values it is just the cumulative values of this categorical data so we just use the cumulative value to show how they add up up to 100 and that creates what we call a Pareto chart okay another exercise which one of the following graphical representation can be used to display a qualitative data we know from what we just did we know that it can be a qualitative qualitative data can be in a frequency table pie chart or a bar chart so let's look here the size histogram we never talk about histogram yes we did talk about the pie chart we never I don't know what that we never spoke about scatter plot we never spoke about an ogive or jiff we never spoke about a frequency polygon but we introduced the concept of frequency polygon with a Pareto but we never said it is a frequency polygon so the correct answer will be a pie chart so now let's look at how do we visualize numerical values visualizing numerical values they uh we need to create what we call an audit array which means we need to sort the data from lowest value to highest value we also can use the audit array create what we call a frequency distribution an accumulative distribution table so with an audit array we are able to create what we call a stem and leaf plot which gives you the distribution of your data a frequency distribution it's like your your summary table for numerical data it also gives you give you the frequencies and also gives you the percentages but it looks at almost exactly the same as a frequency table but this is meant for numerical data and once we have a frequency distribution table we can create what we call a histogram from the data that we have summarized we can also create what we call a frequency polygon or we can create a cumulative frequency polygon which we call an or give okay i'll go through this quick quick as well so an audit array is a way of organizing the data from lowest to highest so you rank your data in that manner from lowest to highest so it makes it easier to also see your data points like for example here i have the day students i can see the age of those day students and quickly i can browse through sometimes using a table is not easy to recognize the patterns but i can see there we have two 20 20 year olds we have three 18 year olds and so forth and if i look at the night school i can see that at night uh night we start with uh the youngest during the night student it's 18 years whereas in the day it was 16 but i can see that the highest one or the oldest person in that group was 45 whereas in day it was 42 and i can see that this looks smaller than the average of your day students but it doesn't give you much so we can take this data and visualize it in a different way which we can use estimate with plot okay so with the audit array as well it helps you to see where your range is so the range is just taking your highest value minus your lowest value so you can take your highest value minus your lowest value it will give you what the range is and it also helps you to identify if there are any outliers but it might be very difficult to identify those because if for example we have a five year old in day school therefore it means we have a problem there there is an outlier we cannot have somebody who is five years old at the college or we have somebody who is 96 years old at the college but these days we do have those so we need to investigate what that outlier is all about and fix the data if it's a data problem okay so like i said it's always not easy to check or to analyze a data in the table but we can use visualization so for a numerical data we use what we call a stem and leaf plot which organizes the data into groups and these groups are called the stems so for example with the data that we just looked at the first digit of that data that we looked at if i go back to all these first digits which are the first numbers we call them the stem and each stem is related to the leaves so the leaves are those values presiding there the stem and if i look at this so six seven seven eighteen oh eight eight eight will be my leaf but they all have the same stem if i look at all this so from 16 until 19 they have the same stem which is one so they will all be grouped under one as my stem so how do we draw this stem and leaf plot to draw the stem and leaf plot using the same data that we were i was using them we'll draw for the students and my students so you will see there so the first digit remember it's on it so the first digit we first put all the first digits on there so we know that the first digit is one and two and three and four and we can put them there and then now we're going to put all the leaves and the one we put all the values even when the value repeats itself even if it repeats itself ten times you're going to put it there ten times so let's say for example like the eight eighteen eighteen eighteen so we're going to put all three and these when you read a stem and leaf plot this is where the challenge comes when you read the stem and leaf plot because sometimes in the exam they might ask you to tell them what is the lowest value when they give you the stem and leaf they might ask you what is the lowest value of this Emily what is the highest thing what is the second most highest thing what is this what is that you need to know that to read this you must also include the stem and the leaf when you are interpreting the information you cannot say it's six it's seven it's seven so we're going to say it's sixteen seventeen it's seventeen eighteen like that nineteen like when you read for two it will be twenty twenty one twenty two so you always include the leaf as well and the stem together so for the night you can see there and I can see that both my night and day students are very skewed and this one shows me that it's very it's like left skewed because the tail is to the sorry it's the right skewed because the tail is to the right and this one the tail actually is not that bad but it also it's skewed right skewed because the tail is to the right as well okay and that is visualizing numerical data that is in an ordered array by using a stem and leaf plot so now in the exam you will find questions like this or even in your assignment where they say you now understand what the stem and leaf plot is so here they just give you a random they say we have a stem and leaf display that describe two digits which means the digits start from 20 and 80 so if I go back to that the digits start from 80 so 20 and 80 it looks like this so they will be saying it starts from 16 to 42 it means one and the same thing so you need to use your imagination for one of the classes displayed the row appears as five is my my stem and then my leaves are two four and six which means then what is this in terms of the data so now you need to decipher this you need to unpack this and write it as a number like for example they give you the stem and leaf and they want you to write it as a table and that's what they are asking you so to write back you will say this is 52 and this is 54 and that is 56 if I look at the data here I must look for the one that looks exactly the same and so I can put end there so I can see that it's option number three and that's how you use the stem and leaf or you unpack your stem and leaf so that you understand what it means okay type of questions that they might ask you when it comes to a stem and leaf diagram is this like I explained when I did that what is the highest number what is the lowest number at the moment because we didn't do the mean and the median I am going to like this notice like I cannot ask you to do this exercise now but when you do the practice practices and you go through your assignment as well and you see questions like this then you will know how to answer them so for example here they gave you the stem and leaf plot and they are asking you which one of the following statement is correct the range is zero so we know what the range is the range remember we discussed this the range is your highest value minus your lowest value what is my highest value here my highest value will be the last point of this stem and leaf yeah and my lowest will be my first point of my leaf there so my highest will be 86 minus my lowest is 76 and that will give you 86 minus 76 and that will give you 50 and that is your range what is my fifth largest value so my fifth so I must start from the bottom and read the values up until I get my fifth value so one two three four five my fifth value should be h so my fifth value here should be h because that's my fifth number on the table there are 32 numbers now here is a challenge because I have to go and read each one of them so you have to read count the leaves only the leaves not the stem you don't include the stem but only the leaf so you say one two three four five six seven eight nine ten eleven twelve eighteen fourteen fifteen sixteen seventeen eighteen nineteen twenty twenty one twenty two twenty three twenty four ten fifteen seventeen twenty eight twenty nine thirty twenty one twenty two thirty three did I miss something I don't know I counted so quick so I'm going to double check myself because I was counting very quickly three at the top four five six seven eight nine ten eleven twelve thirteen fourteen sixteen sixteen seventeen eighteen nineteen twenty twenty one twenty two twenty three twenty four twenty five twenty six twenty seven twenty eight twenty nine thirty thirty one thirty two thirty three okay I counted right so there are 33 values here not 32 I'm looking for the correct answer remember that that's what we're looking for and then it says the mode is zero and five so we're going to discuss the mode on on Saturday do I leave it for Saturday okay because I need to find the answer for this question the mode is the most appearing number the number that appears more than the other numbers so if I look at this scan scan scan zero appears three times but this is 70 70 70 if I look at this it says the mode is zero and it's five I'm just going to say no because that is it should say it's 70 70 70 70 so the mode should be 70 here because only 70 appears three times if I look at this so the mode here is 70 that's the number that appears more than the rest then it also asks the median is 64 oh go see the median I need to find the position of the median which is the middle number the median the middle number we use the position n plus one divide by two to find the median position so our n is 33 plus one divide by two so it will be 33 plus one and I'm going to divide this answer equals and divide by two and I get it's 17 so I'm going I can start from the bottom or I can start at the top to count the values so I'm only going to use also the leaves so I must get to 17 because it set the position so I must put it here this is a position we're going to deal with this in detail on Saturday so one two three four five six seven eight nine ten eleven twelve thirteen fourteen fifteen sixteen seventeen the mean is 64 because I must also include the leaf when I decode the values so that is the correct one as you can see this is what probably you will get into in the exam as well you will have to go through each statement and you need to understand how to calculate each one of them using the family plot okay so now let's look at other method of summarizing the data which is and I can see that we are running out of time as well now at no problem we will be done just now so we can use also the frequency distribution table and also for example your data needs to be ordered so I'm using ordered data here in the assignment or in the exam your data will not be ordered so you will need to sort your data you need to first before you do anything arrange your data from the lowest or smallest value to your highest mine are already sorted so I can see my lowest is 12 and my highest is that the first step the first step of creating a frequency distribution table is to calculate the range dealt with the range highest minus lowest 58 minus 12 which is 46 then we select the number of classes we select five because we don't want we can see that this is not a big data set so they are only 20 so we can select five it's best to select the lowest values because then you also want to create smaller classes because those classes they will end up becoming your bar they will look exact if you create a histogram they will look exactly like a bar chart and you don't want to have a bar chart that has so many bars which has so many little which is flat you want to create the one that looks nice and then five is usually the right one to select so we select the five classes and then we compute the interval and an interval is how big the class should be how many numbers are we going to accommodate in a class so as you can see there we have so many so little so in this instance it says it's going to create a 10 different class because we say the width we say we take the range and divide it by the classes you will say but then I'm saying here round it up if you calculate this you will get 9.2 something something which which defeats the logic of rounding rounding offs but in order to create nice and clear classes you will see why I am rounding it up from the 9.2 something number to a 10 so that my classes have a clear distinction or a clear number that starts from an end width so you will see just now what I mean by this so now we use the class to define our intervals so we can look at the data and and determine our first class by just looking at the data and say but our our class here starts with 12 so we can start with 10 and then we add 10 then we can add end at 20 so we can say from 10 until 20 because our interval should just only contain 10 the events in there so we start from 10 and then we add the 10 then it's 20 and then the next one starts so it doesn't include 20 but it has to be less than 20 then the next one will start at 20 and it does not include 30 and then so forth and so forth and so forth and so forth and so forth then now once we have created all our classes we can then start assigning all these data points into each class so we start counting each and every one of them and putting them there so we say those that falls in between 10 but less than 20 we go and say one two three 21 is above so there are only three and then we record the three there then we also go and say those that are between 20 but less than 30 we go one two three four five six does not include 30 so this 30 won't be included so there are only six there and then you do the same for all of the values until you record all of them and when you add them up the same amount should be the same as the value that they recorded there they should be exactly the same they shouldn't be anyone who's missing then we can calculate the percentage by saying 3 divided by 20 gives you 15 6 divided by 20 gives you 30 and you do the same for the whole table now what do we use this for how do we interpret some of this because you might be asked in the exam to interpret some of these questions some of these values as well but before we interpret those values you can also create or they can ask you to also create the accumulating frequency what the cumulative frequencies are at the beginning we know that those that are less than 30 will also be the same as the frequency so they are 30 but those that are less than 30 will include those that are more than more than 10 but less than 30 so it will be 3 plus 9 that sorry 3 plus 6 that will give you 9 and those that are less than 40 it will be 3 plus 6 plus 5 will give you 14 those that are less than 50 you do the same and when you get to the last boundary or the last class the value you get from there it should be the same as the value that you get from there so it should be 20 because it's all of the sum of all of them the cumulative frequency as well so you will just say 3 divided by 20 it will give you 15 9 divided by 20 it will give you 45 or you can add the cumulative frequencies per every class and it would sum this the last class will be equals to 100 percent as the same as your frequency now how do we then in the exam when they ask a question they might ask you a question like how many days were less than oh the temperature was less than 50 so when they ask you in that manner it means everyone or every day that was less than 50 it will include all the previous classes as well so you can come to the cumulative frequency and say they were 80 and if they ask you what was the cumulative percentage and then you can say it was 90 if they ask you how many temperatures were between 20 but less than 30 you just come to the frequency and say there were six if they say in terms of percentage you say there were 30 and that's how you will answer those questions and how do you interpret your frequency distribution table so when you have a frequency distribution table oh this is one of an example that I just used now so yeah they're asking you what is the frequency for the class 10 less than 15 so you go yeah you look at the class less than 10 15 and if you look at the heading there it is a cumulative frequency therefore if this is a cumulative frequency then it means in this cumulative frequency it includes as well those 15 so all what you do is just say to find the frequency you will say 21 minus 15 and that will give you the actual frequency for 10 less than 15 because 21 includes the 15 21 minus 10 minus 15 21 minus 15 then it will give you six and then you will know that that is the answer that you were looking for and that's how you use a frequency table so they might not give you the full table but they will be asking you questions that relates to that table to answer okay so once we have a frequency table uh sorry if you have any questions I know that I'm chasing the guard up to and I'm already on guard up if you have any questions please stop me and ask the question so um the frequency histogram is another it's a visualization of a numerical table based on the frequency distribution and this is a bar chart for numerical values so it's also called a bar chart of data which uses the frequency distribution but we call this a histogram so if you look here it's our frequency distribution table that we have if we take our frequency or our classes you can see that the bars will represent our classes and the height in this instance we using the frequency so the height represents the frequency but now the other thing you need to understand about the frequency the histogram because we're using the class boundaries so when one class finish the other one immediately starts when one finish the other one sorry the other one starts and that is why there are no gaps in between because when one starts so these are the mid mid points so this will mean this is 10 to 20 20 to 30 30 to 40 40 like that to 50 and so on so with the histogram there are no gaps between the graph and the the class boundaries are shown on a horizontal so these are what we call the class boundaries they are shown on the on the horizontal and on the vertical side we show either the frequency or the percentage and those are the characteristics that or the properties that you also need to know going into the exam as well that makes up a histogram okay when we have a histogram we can also tell the shape of of the the data that we are looking at so we can see that it's either symmetrical it's uniform or it's skewed right skewed or left skewed or it is a bimodal data and all this it just gives you the summary or the description of the distribution of your data across okay I was going to ask you to do this exercise and since I am running out of time I'm going to put this exercise on my unisa or also on the whatsapp we can discuss it at the later stage but I will prefer to put it on my unisa as well so that it becomes part of your exercises okay so when we have a numerical or what we call numerical summary table which is the frequency distribution table we can take from the frequency distribution table we can create what we call the midpoint class and then use those midpoint class and the frequencies to create what we call a frequency point work so the midpoints which are those areas at the beginning it will start with zero because there is nothing in the in the midpoint five but at midpoint 15 which was our our class boundary remember it starts from 10 to 15 so at midpoint boundary 15 the midpoint it was three so you can see that it relates to that frequency and for 25 which is based on the class boundaries of 20 and less than 30 it was six and you can see the shape looks like that so we use this what we call a cumulative sorry I clicked this is a frequency polygon or a percentage polygon because we only use the percentages and sometimes we can use the frequencies and call it the frequency polygon when we use the cumulative values then we change we no longer using the midpoint but we use the lower class boundaries so we take all the values of your lower boundaries we create them as our indicate or on our horizontal variable and then then we calculate the cumulative frequency percentages based on the lowest boundaries so you will see that it was that at the lowest boundary as 10 there won't be anything at 10 because there was nothing relating to 10 but there was something between 10 and 20 so that is why it's when we use the midpoints no it makes it makes it easier but when we use the lowest class boundaries then we won't have anything also for that one so it was that at 10 and then when it goes to 20 remember at 20 there were other categories so including three so they will be 15 and so forth and this is what we call an orgif or what we call a frequency percentage polygon or a cumulative frequency polygon or an orgif it uses the cumulative percentages and we can use it to compare groups of information with this okay so when we do chapter 12 we will talk more about the scatter plot so but you must also know that the scatter plot is a a visualization graph for numerical values but here is not only for one numerical value but for two numerical values and we use this to check the relationship or to determine the relationship between two numerical values so if you look here we have the volume and the cost per day and if we plot them you will see that we can see the relationship that when the volume bought per day the cost also increases when the volume is bought per day increases also the cost for a day also increases which is very different so this is the production volume and this are your cost per day but we will discuss the scatter plot in more details when we do chapter 12 when we look at the regression and the correlations but for now I've been talking for almost two hours so my throat is still in me that's why we need to always have a break to do this final exercise so the following techniques are applicable to a quantitative data so we need to determine which one is applicable to a quantitative data so we know that we use an ordered array we know that we use a frequency distribution table we know that we use a stem and leaf plot we know that we can use a scatter plot a scatter diagram which is the scatter plot or the dot plot that we have therefore it means all of the above so which one of the following statement is applicable to quantitative data so since we have selected all of the above therefore we can select we can say only e is correct because e says all of the above and we know that all of the above which is abc and d are correct so the answer will just be only e and that's how you will sorry answer the questions in the exam or in your assignment as well and that concludes what we were supposed to do usually this is only one and a half hour but because we started late it ran in we ran into trouble there as well so to just to recap on what we did today we described what statistics is we define the key concepts of statistics we looked at the types of variables that we use we also looked at the different types or the different levels of measurements and we also ended up looking at how we organize categorical data in terms of summary tables pie chart and bar chart and how we organize and visualize numerical variables by using other arrays frequency distributions family plot a histogram and a person percentage polygon and an or give and that's all what we did today and this is just a graphical distribution indication of when we look at qualitative data the types of very qualitative data the types of tables we can create when you look at the graphical methods that relates to qualitative data it's only by chat and pie chart when we look at the quantitative data the tables you can see that there are so many we can create the frequency distribution frequency relative distribution thank you relative relative distribution and so when it comes to the graphical methods we can use the dot plot histogram the dot plot is like your scatter plot it means one and the same thing because we use dots if you go back just two more slides you can see that those are they look like almost dots so we call it a dot plot or you can call it a scatter plot so we use the histogram a stem and leaves and an or give and thank you guys