 Also, post the register, just make sure that before you leave today, you complete the register and you can continue with today's session. So welcome to another session. Today we're going to be looking at the relationship between two nominal variables, so two categorical variables, which is also known as chi-square test or a chi-square test of contingency table. Do you have any questions before we can even begin? Any questions from me? Okay, cool. Then we can just continue with today's session. So since I said we're going to be looking at chi-square test, which is testing the relationship between two new nominal variables, what you require is to know how to do some of the calculation, but I don't think like with the previous sessions, there is not much or too much expectation in terms of you doing any calculations during the exam as well, but you just need to know how things are created or developed or how they are being calculated as well, and you also just need to know what the definition of certain topics or themes or subjects that we're discussing in the chi-square test for contingency table. And also I will also show you a table which consists of the critical values that we use to make a decision because a chi-square test is also a hypothesis testing that you do to test the relationship, so you need to make a decision at the end. And then you also need to know how to use your calculator to calculate some of the things, and I will take you through some of the concepts as we go along. So when we talk about chi-square test, we are talking about a hypothesis or a test that you do to investigate whether the distribution of the categorical variable, which it can either be nominal or ordinal, but most of the time it's nominal variables differ from one another. Similar to chi-square test of independence because this one uses also a contingency table, which will have rows and columns, so it can be x number of rows and y number of columns that you are given within that nominal variable. What do I mean? If I have two variables, one of them is gender and one of them is marital status, and when I talk about gender, I'm talking about the two types of gender specifically, male or female, and when I talk about marital status, you will have single, married, divorced, widowed, and so forth, right? So you will have multiple values within that variable. Now, if I take gender and a married and I create a contingency table, if I put gender on the row, then it means I will have two rows because it will be male and female because gender has only two values, and if I put a married on the columns, which will be at the top, then I will have married with the four or five categories that it has, so therefore it means also values that it has, therefore it means it will have five columns, right? So that will be the contingency table. Even though with the chi-square test, we always look when we do the decision, we always look at it, or we always refer the chi-square test as an upper tail area because it is right skewed distribution. Chi-square test, it is a two-sided test. It is a non-directional test. This is very important because usually sometimes we make a mistake because when we look at the distribution of a chi-square, it looks like this, which the region of rejection will always be on the upper side, where this will be your chi-square critical value, and it will always be on the left-hand side, and we assume that it's a one-directional test. No, it is a two-tailed test. It is also referred to as a non-directional test. So when we state a hypothesis, remember because this is a test, we need to test and make a decision, therefore it means for hypothesis testing, we need to have two statements, the null hypothesis and the alternative hypothesis. Your null hypothesis will state that two categorical variables are independent. Always it will state that they are independent, or it can also refer to there is no relationship between the two categorical variables. So if I had gender and marital status, then there is no relationship between gender and marital status. The alternative, it will state that the two categorical variables are dependent, or we can state that the two categorical variables, there is a relationship between them. That is when we state the null hypothesis. Some way, when you work with or you do a chi-square test for contingency table, this will be your contingency table that looks like this, where you will have your rows, which will have your categories and your variable category of data one and data two at the top. So these are two variables, category variable and data type variable. Where they joined, where it is category one and data one, it will have a value and where it says category one and data two, there will be a value. If you add the value of category one, data one and category one, data two, when you add A and B, it will give you the total of category one. So therefore, it means for category one, we can say regardless of what type of data they are, they are A plus B. The same will happen with data type one. If you add category one and category two data, you will have A plus B. Therefore, you can just say for data one, there will be A plus C values, regardless of whether they are category one or C. We're going to use this contingency table most of the time when we do a chi-square test. So you need to be able to know that if they didn't give you the totals, you need to be able to calculate the total by adding the joint values within the table. Once we have calculated the table, we need to also calculate what we call the expected values. And I'm going to show you the formula to calculate the expected value. The expected value we calculated by using the row totals and the column totals and the grand total, which will be the A plus B plus C, which means it's all the values together. And to test, we need to calculate the test statistic, which is a chi-square test statistic. It's given by the sum of your observed value, which will be the values that you are given on the original table minus the expected value, which is the value that you need to calculate. And you need to square those values. Therefore, it means the answer multiplied by itself twice, then you divide that answer by the expected value. So this formula, actually, they wrote it wrong. So that sum should be the sum of the whole fraction. But we will look at the example just now. Once you have calculated your test statistic and you went and you found your critical value, remember your critical value is where it creates the region of rejection for you to say whether if things falls below the region of rejection, you do not reject. If it falls above the region of rejection, you reject. And that will be the decision rule that you make. So if your test statistic is greater than your critical value, you are going to reject the null hypothesis. What does that mean? It means because we know that the region of rejection will be on this side. If your test statistics falls in the shaded area, we're going to reject the null hypothesis. Otherwise, if it falls in the white area, we do nothing. We do not reject the null hypothesis. How we find the region of rejection or what we call the critical value, which is the chi-square critical value, we find it by using the alpha value or what we call the level of significance and the degrees of freedom. So our critical value, we find it by using alpha and the degrees of freedom. And our degrees of freedom is always the number of rows minus one times the number of columns minus one. That is the degrees of freedom. You count how many rows we have, how many columns we have, and you multiply the product after you have subtracted one from the row and the column. And then you go to the table to make a decision. Let's look at that. Before we do that, let's recap on the steps. So the first step is, for you, you will be given the observed value, which are your observed frequency value. Then you need to calculate your expected value, which is your row total multiplied by the column total, divided by the grand total. And then you need to determine what your degrees of freedom is by looking at how many number of rows you have, how many number of columns you have, and it's r minus one times c minus one. Then you need to calculate your test statistic, which is your chi-square test statistic, which is the sum of your observed minus expected squares divided by the expected. And then you need to make a decision based on your critical value, which you would have found by using your degrees of freedom and the level of significance and your test statistic. And if it falls above the critical value, you reject the null hypothesis. So let's get an example. Do you like the television program? And yeah, the responses were in this questionnaire. Let's assume that this was a survey done on the street where we opened a shop where they sell TV packages. Let's say it's DSTV. Let's assume that this is we open a store for DSTV. So we ask people if they like the TV program or not. So did you like or dislike? So they need to tick either yes or no or like or dislike. The other question on this questionnaire that we asked was what is your gender? And they would have ticked whether they are male or female. Now we need to test whether or is there a relationship between gender and the response to the question, do you like TV program or television program? So on our roles, we put the question answer, like or dislike. And on our columns, we put male or female. And from this we can see that there were 66 male and of those 66 male, 36 liked the TV program. They didn't like it, disliked it. There were also 39 females, 14 like, 25 didn't like it. But also there were 50 people who liked the TV program regardless of whether they were male or female. So 36 were male, 14 were female. And 55 disliked the TV program. They were male and 25 were female. And in total, there were only 105 people that answered or responded to that questionnaire. So you can see that 66 plus 39 is 105. The same way as 30 plus 25 is 55 and 14 plus 25 is 39. And that's how you can calculate the total if they didn't give you the total. Now stating the null hypothesis, there is a relationship between gender and choice of TV program. Alternative, there is a relationship or we could have said the number one would have said independent. And the two categorical variables are dependent when we look at the alternative. But now we're using, there is a no relationship null hypothesis. There is a relationship alternative. Now we need to calculate the expected values. So to calculate the expected value, we always assume that this table that we're using it's independent of each other. Therefore, we're going to take the row total multiplied by the column total and divide by the grand total profile. So going to take, if we want to calculate the expected value for 36, we're going to take the row total, which is 50 multiplied by the column total, which is 66 divided by the grand total, which is 105. So if you take 50 multiplied by 66 divided by 105, it should give you 31.43, which will be your expected value. And you can do it for all the other values, except for the total. So you can do for female like, so you will say row total is 50. So you will take 50, the row total column total is 39 multiplied by 39 divided by 105. And you will get the answer of 50 multiplied by 39 multiplied by 39 equals divide by 105 equals and the answer you get is 18.57, 18.57. And that will be your expected value and you do for the next one. So for 30, it will be 55 multiplied by 66 divided by 105 and the answer will be 34.57, 34.57. And the last one would be 55 multiplied by 39 equals divide by 105 equals and that gives you 20.43. And if you add the expected values, they should give you the same amount. So 31.43 and 18.53 should give you 50, 31.43 and then 4.53 should give you 66 the same way for the others. And that's how you will calculate the expected value. And these are our expected values like we have already calculated them. And once we have calculated the expected value, now let's find the degrees of freedom. How many number of rows you count except the total? One, two, there are two rows and columns. One, two, there are two columns. So remember, C and R, R means rows. There are two of them. C, there are two of them. Two minus one times two minus one is one times one, which is equals to one. So our degrees of freedom is one. Now let's go find the critical value and also calculate the test statistic. First, let's calculate the test statistic. Remember, our chi-squared state, which is our test statistic, is given by the sum of your observed minus your expected square. Actually, I don't have to put it in bracket. Divide by the expected. Now, since this is the formula and it says the sum, the sum means total, adding up, right? Adding up, it means I can come here at the bottom and create a total column. These are my observed values and these are my expected values. If I look at the expected and the observed, remember, 36 crossbones with the 1.4, 14 crossbones with 18 like that. So I can rewrite them in a row in the column format. So our observed, 36 and 31, 14 and 30 and 34 and 25 and 33. If I add all of them, I will get 105. If I add all of them, I'll get 105. The same way as if I add this, I will get another answer for it. But all I want to get to is this part of the formula. So I'm going to break it down step by step. So the first step is to say your observed minus your expected. So if you take observed, your minus your expected and you get your answer, right? Then you do for the rest of them and you get your answer for the top. So this is the answer for all the top. But remember the top means dividing each one by the expected. So it means we need to take this value, 4 comma 5, 7, multiply by itself again. So it will be 4 comma 5, 7, multiply by 4, 5, 7, divide that number by 31.43. And the answer you will get will be 60, 0.67. So how did we get 0.67? We say because we have whatever is inside the bracket, which is observed minus expected, squared, we say 4 comma 5, 7, multiply by 4 comma 5, 7, and we divide the answer by the expected and our expected is 31 comma 43. Then you do that for all of them minus 4.58 multiplied by minus 4.58 equals divided by 18.58. You do the same with this equal divided by that number and the same. And those will be your answers. Once you have done all of them, then you can say at the end you can create your portal there. If the summation, remember we have the summation there, the summation of O minus E squared divided by E. So this summation means adding all these values together. So you're just going to add 0.67 plus 1.3 plus 0.61 plus 1.03 and that will give you 3.44. And that's how you calculate your test statistic. Once you have your test statistic, then you can go to the critical value table. So we know that our critical value, we find it using alpha and the degrees of freedom. And we already know that our degrees of freedom is 1. Our alpha, if they gave us of 5% of 0.05, therefore our degrees of freedom is 1. It will be on the side and our alpha value will be at the top where they meet. That will be our critical value. And if that is our critical value, then you can make a decision because our critical value, if it's 3 comma 481 going back, we found our test statistic is 3.44. So where is 3.44? It will fall if they do not reject. So since our chi-square statistic is 3.44, then it will be on the right. And we found that on our test statistic is 3.8, oh sorry, our critical value is 3.841. And we know that this value of our chi-square statistic is less than our probability value or our p-value of 3.841. Therefore, we do not reject the null hypothesis, like I said, because it will fall in there, do not reject 3.841. Remember anything that falls there, we reject the null hypothesis. Anything that falls here, we do not reject the null hypothesis. And that is how you do the test statistic, the test. And in conclusion, we say there is no significant relationship between the product choice and gender. That's it in terms of chi-square test for contingency table. Any questions before we go into exercises? Like I said in the beginning, in your module, usually all these calculations, you just need to know how we got there. But in the exam, you do not have a huge expectation in terms of how do we do the calculation. As long as you know the concepts and you are able to explain some of them, it should be sufficient. So let's look at the activities. And like I said, to others who joined today's session, I'm going to end at 7. Let's see how far we get, because your questions almost are similar to one another. So this was from the past exam paper of October, November 2017. The question read as follows, and this is your turn to answer. A researcher wants to establish whether the type of employment category is filled by employees of particular company and is significantly related to the agenda. The employees can be categorized as manager, human resource, administrative, maintenance, or IT worker. And the genders are male and female, which would be the appropriate test to use? Is it one? T test for two independent variables, number two, Pearson correlation test, number three, Keisquare test, statistic. Is it one, two, or three? Three. It will be three. So yeah, because today we're only talking about one question, right? And we are discussing one concept. It's not like when you are sitting in the exam and you are reading a question and you get panicked because you don't know what is which test to use. Always pay attention to the details. Now, this one, the key words here where are they significantly related? Like the previous session that we had last week, where we speak about the hypothesis test to test the relationship. There is a difference in terms of how you test the hypothesis test for independent groups and for T test and so on. Because there they will ask you are the two groups different? Do they differ? Are they different? Or is it bigger than the other? When it comes to the test of relationship, there is only one key word related because it talks about the relationship. Now, the only difference will be the two. The relationship can either be of two numerical variables, which we did last week, which talks to Pearson correlation test, right? If you have two numerical variable, where you have categorical variable and it talks about relations, related, independent. Probably they won't say a lot about independent, but when they talk about related, then it needs to give you a key. And also, when you read the question and they have been giving you anything like a numerical value like score, exam mark, age, temperature, things like that, then you should know that this has nothing to do with Pearson. It's a chi-square test, right? That's one. One down. 15 more to go. Question two. Which of the following is an appropriate formula for the chi-square test? Yeah, they don't expect you to know how to calculate it. They just want to see if you know the formula. So based on what we just did today, which one is the correct one? Is it one, two, or three? One. It will be one because one is the sum of your observed minus your expected squared divided by your expected. Number two is to test for one sample. Number three is to test for Pearson correlation test. That is the formula to test to calculate coefficient of correlation. Contingency table represents some of you might have joined late, but if you know the answer, you can answer. Number one, does it represent a distribution of the frequency for a variable? Two, a frequency count for each of the number of possible outcomes of an experiment? Three, the frequency count of, okay, the frequency count if each outcome measure on two nominal scale variables when they are cross-class infinite. Is it one, two, or three? What are we talking about today? Number one. Nope. It says the frequency vary for a variable. Therefore, it means there is only one thing there. What does this represent? A histogram. Did you do histograms or a frequency distribution table? Or a bar chart. It can also represent a bar chart because it's only one variable. The histogram is a bar chart for numerical value and a bar chart is a chart for categorical variable, but only one. So then it's two and three left. Which one? Key waves. Key waves. What is a number of possible outcome? That is the probability, right? And today we are not talking about probabilities. So today we're talking about testing for two nominal variables or two categorical variables, right? So the only option that we have is three because with a contingency table, like this contingency table, two nominal variables because there is no rank or order in this like or dislike in male and female. There is no order. So they are both nominal variables and they are cross-classified because we cross-classified both of them. If individually the choice will be like and dislike and there will be 50 and 55 and male or female in terms of gender and so on. So that will be number three. So four counts squared, can you only have two rows and two columns? No, you can have more. Like I explained earlier when we did the example, I spoke about marital status. So with marital status you can have a single married divorce. We don't, I don't know what the other ones are, but you see there are already four that I could mention. If they have, if when you create the survey, you are asking people questions and some of the questions they need to take between those two groups or four groups or three groups, it depends. Which other one has at least three, let's say level of school, right? You have primary, secondary and high school, things like that. There are three levels in there. So it doesn't have to be only two. Okay, thank you. Can be more. Yes. And hence we use what we call a contingency table and a contingency table has r rows or n rows and n columns, oh sorry, n columns because we don't know how many can be 10 rows and five columns or it can be one row and five columns. Depends on the categories that you are using. Okay, so moving on to the next question. Which of the following tests are appropriate for determining whether a relationship exists between two variables if both are measured on a scale of measurement, on a nominal scale of measurement? One, two, or three. One says the test for independent samples. Two says it is, oh sorry, one says it's the t test for two independent samples and two states testing the significance of a Pearson correlation coefficient and three is the kai square test. Number three. Three, yeah. Yes, it is number three. It is a kai square test. The kai square test is used to compare which aspects of the data for two samples. One, the distribution of the data is classified in terms of a variable and that is a key one, a variable. The sample means of the variable for each sample, that is another key word, the variance of the variable as measured for each sample and that is another one. Probably a variable should not be the key word, the key word is classified. Think about what we discussed today. Yes, Komoto, your hand is up. Good evening. I'm not sure. I think the answer is one, but once you've explained the answer, please also explain where three would fall under or what three represents. Thank you. None of the two at the bottom, two and three deals with what we discussed today. Remember everything to do with kai square test. We never spoke about sample means, we never spoke about variance. So you cannot compare test statistic with any of those. We don't compare the sample mean, but we compare the variable as they are classified. For example, gender and choice. We can cross classify them, right? As a variable is classified as gender, it has two levels, male and female. Choice had two levels, like and dislike. So for sample mean, it's when we do a test for independence for the variance as well is test for independence when we testing for two independent variables, when your variances are not equal or they are equal, which will be the test of ANOVA usually. So for today, what you have left is the distribution of the data as classified in terms of a variable. Number six, a number of psychiatric patients are classified by gender and into four categories as schizophrenic, severely depressed bipolar and others. As you can see, they are classified tonight. Which of the following is suitable for representing counts of frequency of a person which falls into each possible category? One, is it the contingency table? Can we use this information to create a contingency table? Or can we create a scatter plot? Or can we create a histogram? Or can we create a spreadsheet? Think about what we discussed today. Number one. Yes, definitely number one. It is a contingency table because you are given two categorical variables, which is gender and the categories of probably mental disorder or something like that, which are classified under those two of psychiatric disorders, schizophrenic and depressed and bipolar and others. A scatter plot, that's what we discussed last week. We used a scatter plot to look at the relationship between two numerical variables. A histogram, we also spoke about it today to say it's only a bar chart for visualizing a numerical value, only one numerical value. And a spreadsheet is something that is in general is a tool, is a tool that we use. Next one. A researcher studying possible sex linked inheritance of three psychiatric disorder denoted by ABC, tabulated by gender with male or female of 100 psychiatric patients against diagnosis. And this is the table. You have male, female, A, B and C and they also calculated the total of the psychiatric disorder. Which research design did the researcher use? Is it one, a correlational design? Two, is it a two sample group design? Three, is it a three sample groups design? Now think about this in this way. Correlation also refers to relationship, right? If we think about the correlation, we talk about relationship. And also think about what we have been discussing. If you have, if you've watched or you attended the first session where we did the intro into the basic statistics for human science, we discussed a decision tree, right? And we spoke about correlational studies or tests. And we said this for testing the relationship when we have numerical value and when we have categorical value. So on this one, I'm going to give it to you because probably it's very confusing in a way of the way they have asked them. And today we didn't touch anything about the research design. And because it talks about a research design is different from a test statistic. A research design is how you formulate what type of a research you are doing. Are you doing a correlational research where you're looking at the relationship between the variables, regardless of whether the variables are numeric or they are categorical. And you will have two sample groups. And also you will need to think about whether are they independent or independent? Is it the before and after things like that that you need to take into consideration? And a three sample group, you need to think about the ANOVA because you cannot do a T test or you cannot do a Z test on that. So you will have to do an ANOVA test, things like that. So those are research designs. So I'm not going to talk too much. I'm not sure if in your module you do, where did I get this? I'm not sure if it's even like 3704. But yeah, that's where I got the question from. And this one, you can answer this one. Based on the same information, what are the requirements with regard to the test statistics that needs to be performed? So now think about it. We, at the beginning, let's go back to the beginning. At the beginning, when we started the class, when we introduced what chi-square test is, right? When we talk about test for relationship, when I'm here on this slide, I'm giving you the answer to the question. So when I go back, I'm going back. So you need to tell me the answer because some of you were not there by the time when we were discussing that question or that section. So now, based on that information, what are the requirements with regards to the test statistics that we will be performing? So yeah, they're asking you what will be the type of a test, a chi-square test, is it a directional test statistic? Is it a non-directional test statistic? Or there is no test statistic required? You want me to go back to that slide? We can do that. Number two, let's go show everyone a chi-square test, even though it evaluates one-sided test, the test is always a two-sided test or what we call a non-directional test. So that question will be number two. The answer will be a non-directional test statistic is required to answer this question. Question number nine, I hope you can still see my slides, representing the gender, which is male or female of members of parliament versus the political party to which they belong is best done in this form. Is it done? Is it best done in a tabloid, contingency table or a two-sample group design? Contingency table. Definitely in a contingency table. A researcher wants to establish whether the relationship exists between people's religious affiliation and whether they are in favor or against the death penalty. Yes or no, which of the following would be most appropriate to use? One, will it be a t-test for two independent samples? Two, will it be a chi-square test? Or will it be a Pearson correlation test? Or will it be a t-test for two independent samples? Number two, it will be number two because it says the relationship exists between people's religion, affiliation and whether they favor a death penalty. Yes or no? So there are two categorical variables, so it's number two. Ali wonders whether a relationship exists between the person's length and their leadership ability. She collects data from a sample of 95 people, classified them as short and tall and as leaders, followers and those she could not classify, which means they are another category in terms of leadership ability. From this sheet, she creates a contingency table. This is the contingency table of the length of a person and whether they are leadership ability or whether they are a leader, a follower or those that she cannot classify. If the frequency data is evenly distributed throughout the categories with no proportional differences between the tall and the short people, as far as leadership ability goes, what would your expected number of people who can be classified as short leaders would be? Now, here is the thing. As such, in your exam, you are not allowed, although they might not ask you to do some calculations, but I see here they are asking you to calculate the expected value of short leaders. So it's this leaders, short and leaders. So you can see from this table, there are no totals. So I will aid you to calculate the total and also go and calculate the total. So you're going to add the two plus 14 is 46 plus six, it will be 52. So that will be 52. 12 plus 22 is 34 plus one plus nine. It will be, I'm stuck with my net. 43. It will be 43. Yes, 43. And 43 plus 42 should give us 95. The same, 12 plus 32, it's 44. And 22 plus 14 is 36. And nine plus six is 15. And 44 plus 36 plus 15 is 95. Okay, so now, so we need to take row total of leader and column total of short. So row total is 44, multiply by column total of 52 divided by the grand total of 95. So we say 44 multiplied by 52 equals 2288 divided by 95. Am I doing the right thing? No, I'm not. 44 multiplied by 52 equals divide by 95 equals 24.08, which is option number three. And that's how you answer the questions. And I think we have the last question. And the last question follows from the same question that we had. And yet they ask you, determine whether the relationship exists between the length of a leader's ability Sally has to calculate the appropriate test statistic, in which of the following category will the results fall? Now, here is the catch. You need to calculate the test statistic. There is no catch. So because I said to you in your exams, but this was not an exam, it was a tutorial letter. So in the tutorial letter, therefore it means when you are doing assignments, they will ask you to do the calculations. But in the exam, you could see some of the questions that they ask in the exam. So we need to calculate the test statistic. Now we only calculated the test statistic for one. Sorry, the expected value for one, which was 24.08 for leaders and short. So we need to calculate for all of them. So going back, I'm going to go back one up. 44 times 43. That's the next one. 44 times 44 times 43 equals divided by 95. I get 19. I'm going to go back here. 19.92. 19.92. And we go into the next one. 36 times 43. 36 times the row total is 36. The column total is 43 times 43 equals divided by 95. 16.29. 16.29. And I'm going to do the next one, the one below. 15 times 43. 15 times 43 equals divided by 95 equals 6.79. 6.79. And I will go to the next one. 36 times 52. 36 times 52 equals divided by 95 equals 19.71. 19.71. And we do the last one. 15 times 52. 15 times 52 equals divided by 95 equals 8.21. 8.21. And once you are done, I'm going to leave you hanging on this because I need to jump off the cord. And you will say the sum of your observed minus your expected square divided by your expected. And you will say 12 minus 19.92 square divided by 19.92 plus 24 by 32. So let's do 32. 32 minus 24.08 squared divided by 24.08 plus 22 minus 16.29 squared divided by 16.29. Plus, and you do all of them until you get to 6 minus 8.21 squared divided by 8.21. And once you get to your answer, you can look at whether the answer falls between zero and two or zero or below zero or between two and four or above four. So the answer that you get there, you can just choose which one it falls on. I'm not going to be able to help you there because I need to go. In conclusion, you have learned how to do the relationship or test for the relationship between nominal variables. Are there any questions, comments? Please remember to complete the register. If there are no questions or comments, my time with you, thank you for coming. I will keep you updated in terms of the exam preparations. Thank you very much. Thank you. Bye. Thank you.