 This is a continuation of our lecture on descriptive statistics. Welcome to the lecture. Throughout this course, you'll find references to analyzing data with spreadsheets. The specific tool we're currently using is Microsoft Excel. Why do we care? Well, for one thing, this is what people do in the real world. You don't take a bunch of numbers and analyzing them using pencil and paper and calculators, and you need to know how to work with your data. Even more importantly, if you are an undergraduate student and that's who this course is aimed at, you need such a skill on your resume. You need to be able to tell a potential employer that you are capable of doing work in spreadsheets with Microsoft Excel. That's currently the standard. There are quite a number of our students from both of us who got jobs, who got internships in whatever field didn't even matter, mainly because they had this skill on their resume. So it's important for you to know it for your own career development and also for the content of this course for the subject material of the course. It happens that on our handouts page on the course website, we have very, very simple step-by-step instructions to help you get started using Excel and to help you with the different types of analyses. How do I do regression? How do I do descriptives? How do I get a scatter plot? We have those instructions in print form. We have many of them in video tutorials. And I know you know, I don't have to tell you, you can go on YouTube and find all kinds of tutorials to help you use various intro and advanced features of this software tool. Let's learn how to use Excel, a very powerful tool for statistics. We have 17 student scores on a national calculus exam. Notice a couple got a zero, and the highest grade looks like a 95. We have to open up MS Excel. You go to data analysis, look for analysis tools, and then you look for descriptive statistics. Now if you don't see the data analysis there, that means you have to use the add-in feature, and you can look it up, or we have explanations all over the place, how to put in the add-in feature so that you can do descriptive statistics. Once you're in descriptive statistics, make sure to check the summary statistics box. You need to check that box. Okay, enter your data, and now you run it, and you'll get the output on the next slide. You'll see what it looks like. Okay, now we see the output using MS Excel. Notice you get the mean. By the way, you get the mean the hard way, because it gives you the sum, the sum of the data set, 662. Always look at the count to make sure that you get all 17, in this case n was 17. I've seen students, they type in 20 numbers, and the count is 6. They don't notice, you've got to look. If you do 662 over 17, you'll get that mean of 38.94, etc. The standard error we're going to do a little later in the course, so you don't have to worry about it at this point. It's called the standard error of the mean. It's used in inference. There it is, 30. The mode is 0. Okay, and then we have the standard deviation. Now that's the sample standard deviation. There's no room to write the whole thing out. Bear in mind this is the sample standard deviation, so they divided by n minus 1. So your standard deviation is 33.44. The variance is square that, you can get the sample variance. Here it says sample variance, and it's 1118.43. The ketosis is never used by us in this course, so ignore it. It's a measure of peakedness, skewness, we know what that is. If it's a number very close to 0, then you have symmetric data. Here this seems to be a bit of a positive skew, and that's what happens when the mean and the median are not the same. There's a positive skew. Now we look at the range, the range is the highest value minus the lowest value, and it gives you the minimum, the maximum. You should always look at that to see if that was the actual minimum and maximum. Okay, well the highest value is 95, the lowest is 0, and the range of course is 95 minus 0 or 95. And that's basically how you look at the output. It gives you everything in one shot, it's quick, and you should learn how to use this tool. It's very powerful. What else can we do with a data set to make it interesting and to learn something about it? We can take our numerical data that has a particular sample mean and a particular standard deviation and transform it into a new set of data that's called standardized data. And we denote it as Z, the original data we might call X, and the new set of data has a mean of 0 and a standard deviation of 1. When you standardize data, you know for sure what the mean and standard deviation are. With your original data, they could be anything. In addition, this transformed data set will not have units. If your original data was in dollars or if your original data was in hours, the standardized data are their pure numbers, no units. Take a look at the formula and you'll see why. We take our X's, those are on the right, and we transform them into Z's. The numerator X minus X bar and the denominator is S, the standard deviation. So for every value of X, we transform it into a Z by taking the deviation around the mean X minus X bar and dividing it by S, the standard deviation. That ensures that your standardized data will have a mean of 0 and a standard deviation of 1. You can see how the units cancel because the X and the X bar and the S, they're all in the same units, dollars, dollars, dollars, hours, hours, hours, and so the units cancel. What can we do with this? We'll see pretty soon when we look at an example. But since the transformed data has a mean of 0, any original score that's below the mean will now be negative. It'll be below 0. Anybody who scored, let's say just as an example, exam scores, if you took an exam and your exam was exactly at the mean, your transformed data, the standardized data will show a score of 0. Any score that's above the mean will end up being positive in the standardized data. Let's look at some examples of standardizing data, turning the data into Z scores. Look at the first one, example one. Very simple data, 0, 2, 4, 6, 8, 10, 6 data values. We have a mean of 5, a standard deviation of 3.74. So for each X value, we can convert it into a Z with the simple transformation of X minus X bar divided by S. And you get the transformed data, the Z scores. If you were to add those up and divide by 6, you know what you will get, right? You'll get a 0 because the mean of the standardized data is 0. And the standard deviation, if you figured that out, would be a 1. Now let's take a look at example 2. Example 2 is exam scores which are usually standardized. When you get a report, if you take a standardized exam like the SAT or the GRE and you get a report, you'll get your standardized score. You'll get your original score and you'll get your standardized score. It's nice to know what everything means, right? So look at the original data that's all the way to the left. You can imagine this on a timeline from left to right. The first two columns are the original data, first X and then Z, which was the transformation of the original data. X had a mean of 75.57, S, standard deviation of the data set was 23.75. And you have all those Z values. That arrow is highlighting the most unusual Z value, which was negative 2.29, very, very far below the mean. Is that an outlier? Is it a real value? Is it something important that maybe educators should make note of? It's also very likely and very possible that maybe it was a data entry error. There's that seven. Let's take a look and find the seven that caused this problem. Maybe it was really a 97. So moving along the timeline from left to right, we see that that seven was changed to a 97, which changed everything. The mean of this data set, these X's, is 79.86. The standard deviation is 18.24. And you see the Z values. Now, which is the most extreme Z value, negative 3.12, which corresponds to a value of 23. So once we get rid of the seven, the 23 was the next lowest data value. How do we pick it up? Because we found the unusual Z value. That's what standardization does for you. And then we took a look and did this again in a third iteration. But I think by now you will probably get the idea. Let's continue with our discussion of standardizing data or some kind of what we can call the Z scores. Generally, you're looking for an unusual Z score because that very often indicates something to look out for. Why? It could be, as was mentioned, a data entry error. You see a Z score of plus five or more than that, or less than nine is five. That's incredibly unusual. Generally, when data is standardized, about 95.5% of the data should be Z scores between plus two and minus two. Most of the data should be within two standard deviations about the mean, again, if the data is normally distributed. Here are some values and you'll see this on the Z table directly tied to the Z scores. 95% of the data will be between the values of plus 1.96 and minus 1.96. 99.7% of the data will fall between plus three and minus three. 99.99% of the data will fall between plus four and minus four. That's actually one way of knowing where the data falls in normal distribution. You determine the Z scores and you see how many of them are, let's say, between plus 1.96 and minus 1.96. Let's say data is not normally distributed. In fact, it's badly skewed. Remember, normal distribution means symmetric. Let's say it's badly skewed. It's not normal. We still know something called Chebyshev's theorem that 75% of your data should be between plus two and minus two standard deviations about the mean. Standard deviations give you an indication where your data should be roughly. Worst case scenario, 75% of your data should be within two standard deviations. If the data is a normally distributed data set, then 95.5% will be between plus two and minus two. And we'll learn more about normally distributed data throughout the course. Okay. How do we summarize data that comes to us in pairs, two variables? Well, it depends on the variables. If you're from our very first lecture, you do different things with data that's nominal and different things with data that's quantitative. And so nominal and categorical are the same. What do we do with categorical data? Well, we can draw pretty pictures. We can do some kind of graphical stuff, which we're not going to look at in this course. And we can use contingency tables, which we will look at and which are important. We're going to be carrying contingency tables forward into another topic as you'll see. But for now, we can use contingency tables for two variables that are both nominal or categorical. We can get frequencies. We can get percentages, but that's about it. I guess we could look at it and get the modes, of course. But in terms of the two variables working together, two-way frequencies and two-way percentages. What can we do with numerical data? Basically, if we have two variables coming to us in pairs and they're numerical, we're looking for some kind of relationship. We're looking to see how they relate to each other. You can do that better with quantitative data than you can with categorical data. We can draw a plot of the data where every point on the plot represents a pair of data, call 1x, call 1y. We can do correlation, which we will do later in the semester. Look at the relationship between the two variables. We can draw a line regression and analyze that line. We'll learn that more in-depth later in this course. Let's look at a contingency table. We're looking at two categorical variables. Remember, that's nominal data. We display it in a contingency table. Suppose there's some kind of election coming up and we ask people to identify by party. Assuming there are very few in any other categories, we just have to deal with Republicans and Democrats. Same with gender. As you know, there's more than male-female, but suppose we're just so few so we couldn't analyze it. We have the columns for male-female, the Republican candidate, and then we have Democrat candidate. Notice the total sample size is 1,000. Let's see what happens. There are 400 men who plan on voting in our sample. 250 said they're going for the Republican, 150 said the Democrat. Among the 600 females in the sample, 250 said they're going for the Republican, and 350 said they're going for the Democrat. If you want to turn this into a percentage table, the table on the right, which is actually helpful, divide all the numbers by 1,000. That's n, the sample size. There are 1,000 people divide everything by 1,000. 250 over 1,000, you get 25%. 150 over 1,015%. You can see that 40% of the sample is male, 60% were female. Notice it adds up to 100%, 1,000 hours, 100%. Looking at these percentages, you can actually determine whether there's a relationship between gender and whether they vote Republican or Democrat. What do we do with two quantitative variables? Remember, these variables come to us in pairs. We graph them and we look at the graph and see if we can come up with any pattern. Here we have 10 students and I asked them, I gave them a test, so I have their exam scores. And I asked them to write down what their height in inches is. So for every one of my 10 students, I have an exam score and I have a height. And you know why? Because I have a theory. And I'm testing my theory here. The theory is that the shorter you are, the better you will do on a test. And anyone who knows me knows I might be a little bit self-interested here, but we won't go into that. And I have something to base it on, right? Because if you're tall, and I feel bad for all those tall people, the oxygen has to get all the way up to your brain. This can't be helpful. I have my theory and I decide to test the theory by collecting data. And I plot the data. And I get the plot that you see. It's called a scatter plot. I did it with Excel. And it's kind of hard to come up with any relationship here, isn't it? What I would like to see is maybe that for the taller people, they should be associated with a lower exam score. No, basically it's not working out. So I'm going to have to do something else. We can also get metrics here, but we're not really going to study these until we do the lectures on correlation and regression. However, it doesn't hurt for you to see it. And there's a correlation coefficient listed of 0.12. The square of that, which is also meaningful, is 0.01. And basically what those numbers, those metrics say to us, unfortunately for my theory, is that they say no relationship. This is example two. We're going to look at the scatter plot. And notice with X is hours studied weekly. Somebody actually studies 15 hours a week. And we have somebody who only spends an hour and a half studying. And we're also looking at the high school averages, which seem to be going from around the 53 all the way up to a 99. So we plot this. We have basically two measurements for each subject. How much they study per week? It's an average. And we know what their high school average was. Look at the scatter plot. And it kind of seems linear. Obviously the points are not all on a line. But one look at this and I'd say, wow, this is a linear with a positive slope, which basically indicates the more you study the higher your high school average. And you can see that. Look at the last point all the way to the right. That represents the point. You can see it's between 14 and 16. That's the 15. That's the person who spent 15 hours studying. And they're almost not quite touching the 100 mark because they had a 99 average. So these points are each point represents an XY combination. And if you when we learn correlation, the correlation actually is going to be significant. And it's a positive 0.7843. We didn't learn how to test this yet for significance. It will be significant showing there's a positive relationship between our studied and high school average, which means you can tell people that generally the more you study, the better you'll do in high school. And I think the same would be true of college. Example three of a scatter plot. We're looking at high school average, same as before. But now we're looking at hours. How much you sleep per day? And we notice we have somebody there sleeping. We're looking at 17 students and somebody sleeps seven 14 hours a day. But of course, he's including sleeping during class. Okay. So if somebody is only sleeping six hours, we want to know is there a relationship to how much time you spend sleeping and your high school average. Again, we have 17 people that we have 17 points on the graph. And through the line through it using Excel, you'll learn how to do that. There's instructions on how to use Excel to get a scatter plot. And it looks very linear, but it looks like it has a negative slope. Right. Look at it carefully. You can see the last point. You can see the point where it's 14. That's all the way to the right. That 14 is the person who had a 50 average. You can see that point. So you see each point represents a pair of observations. And clearly it looks like a negative relationship. That's a negative slope. We're looking at. And again, as you see some points, I fell on the line, most of not on the line, but the line does the best job. It's a regression line. It's the best. No line. There's a better job going through these points. And we'll learn how we get this line and notice the correlation coefficient. It's negative to show negative slope negative point nine one six. That's going to turn out to be significant when you learn about significance tests. And basically we conclude people who sleep a lot are not going to do so well in high school. You got to cut down a little bit on your sleep. And now, you know, you got to do a lot of problems, practice, practice, practice. And you'll find a lot of problems in the notes and homework assignments. The more you do, the better you'll get at this. And as we keep telling you, this is an important course. We're not indoctrinating you with teaching how to use data. You can come to the right conclusions. I don't care if you're left leaning, right leaning, center leaning, use data properly, and you're going to become a smarter person.