 Welcome to SPSS and introduction. I'm Barton Polson. And in this course, we're going to look at the statistical program SPSS, and some of its basic functionality, and give you an idea of what it can do and how well it might work in your own data work. Now SPSS, the name deserves a little bit of explanation. Once upon a time, it stood for statistical package for the social sciences. Now it's just SPSS, but that's its origin. One important thing to know is how popular SPSS is. Here's a chart right here that comes from the excellent website r4stats.com. And what it shows is the number of scholarly articles published in 2015, using various statistical packages and languages. And what you can see here is right at the top is SPS statistics. SPSS is number one by far in terms of scholarly research. Also, you can look at jobs. Here's another chart that is also from r4stats.com. And what this shows is analytics job listings on indeed.com in 2015, one major source of tech jobs. SPSS is on the list. But this time you see it's actually a lot lower. It's number six. And so there is a difference here between academic publishing and employment in analytics. Really, what this tells you something about the population or the audience for SPSS, the primary audience of SPSS is academic researchers, especially in the social sciences, but in other fields like business. Now there's some reasons that SPSS is popular in these fields. Number one, it's user friendly. It's got a point and click interface, which allows you to assemble code really quickly. You can save that code as what's called a syntax file. And then you can reuse it, you can adapt it and you can share it with others. Also, SPSS is really well adapted for data from experiments, where you're comparing means via t tests and analysis of variants, with several important options like effect sizes and power analysis built in. And so those are some of the reasons for SPSS is popularity, especially within academic research. In some, we can say a few things. Number one, SPSS, despite being developed about 40 years ago is still popular. It's got an easy to use interface. And it's easy to save and reuse the syntax, giving you a code basis for the work that you do within SPSS. The first thing we need to talk about in SPSS and introduction is setting up and getting ready to do the work. To do that, however, we need to take a minute and talk about versions, additions and modules, which all refer to different kinds of things in SPSS. The choice is really making me think of an overwhelming plethora of possibilities ahead of you. And it's nice to break it down a little bit. So the things we're going to talk about are versions, those are the release updates, the version one version two, additions, those vary according to what's included in a particular purchase. And modules are extra functions that you can get to add on to the abilities of SPSS. We'll start by talking about versions, version one came out in 1968. And at that point, it was called statistical package for the social sciences SPSS version 24 came out in 2016. And now it's called IBM SPSS statistics act like SPSS doesn't stand for anything. Now for this course, I'm using version 22 on a Macintosh computer. Fortunately, there haven't been any extraordinarily major changes between 22 and 24. And everything I'm going to show you in this course will work just fine. In almost any other version of SPSS. Now it is possible that you've heard of something called PASW at some point. And SPSS was briefly called predictive analytics software, during a trademark dispute after SPSS got bought by IBM, that only lasted for a year or so, and it got resolved. The important thing to know is that no matter what version you're using the files generally are highly compatible between versions. And so code that you created in version 16 is probably readable in version 24. There are some backwards compatibility issues for advanced functions like automatic modeling and so on. But most of it is consistent all the way through. Now we also need to talk about editions of SPSS. And there are a few major choices here. There's the base edition, the standard edition, the professional edition, and the premium edition. And they differ by price, and they differ by the functions that are included with each edition. So for example, in base, you get basic statistics, you get linear regression, you get clustering and factor analysis. On the other hand, standard adds on to that logistic regression generalized linear models, and survival analysis. It also adds drag and drop interactive tables. The professional edition adds to that data prep, forecasting, decision trees, and imputation methods. And then finally, the top of the line premium edition of SPSS adds bootstrapping, complex sampling, exact tests, and structural equation modeling. And so each one adds on a number of other functions. Now, this is the product pricing as of August of 2016. And you see, for instance, that SPSS starts in the base at $1,170 per year per person. So it's an annual license. And it goes all the way up to nearly $8,000 per user per year. And so it gets really expensive. However, I want to say this, don't panic. There are other ways aside from having to like, you know, sell your house to get SPSS. Number one, there is a free trial, and you can download SPSS and you can try it for 14 days. And during that time, the best way to do this is see if you can make a business case and get somebody else to buy it for you. There is also academic pricing, student pricing for SPSS starts at $35 for six months. It's not the super duper version, but it is absolutely sufficient for doing the majority of academic research. Now, we also need to talk about modules. And these are the components that add extra functionality to SPSS. And they're the things that differentiate the different editions primarily modules of rephrase, the available modules include advanced statistics, bootstrapping categories, complex samples, conjoint, custom tables, data preparation, decision trees, direct marketing, exact tests, forecasting, missing values, neural networks, and regression. So that's 14 additional modules. And that sounds like a lot. But if you can compare it to the 9,000 packages that are available for are there's a difference there. The other major difference is that these packages, they cost money. So you need to work that into your budget. On the other hand, there are also free plugins that make it possible to use code in our Python, Java, and the Microsoft net framework within SPSS. And so there are abilities that you can add depending on what you need. In sum, we can say this, SPSS has a long history as far as statistical software goes. There are several variations and additional rephrase. There are several variations and additions that you can make to it by adding extra modules. On the other hand, it can be very pricey. So it's something to consider when you're doing the cost benefit analysis of SPSS. The next step in SPSS and introduction and setting up is simply taking a look at SPSS and seeing what the program's like. And the easiest way to do that is to just open it up. When you first open SPSS, you'll get this introductory splash screen that gives you an opportunity to open up some files, recent files and learn more about various things that you can get from SPSS. If you want to, you can click on this box, don't show this dialogue in the future, then you won't have to deal with it again. You can also just press cancel. And that brings you to the data window in SPSS, which has a lot in common with a spreadsheet. It has these rows and columns, where you have one row per case and one column per variable. But there's some very important differences between SPSS and a spreadsheet. To demonstrate this, I'm going to open up a data set that I've used recently. And then when this opens up, you see that it does resemble a spreadsheet, we have the variable names across the top, we have row numbers down the bottom, and we have data in the middle. Now, one important difference between SPSS data window and a spreadsheet is this, you have a data view, but you also have something called a variable view. And it's the same data set. But if we click on it, we see it in a different way. Each of the variables has metadata associated with it. So for instance, age, it tells you the type of the variable. Now, these are mostly numeric. There's a string variable, but you can see there's a lot of choices here, numeric, common dot and so on. You also can specify the width of the variable, the number of decimal places. And then a really important thing that makes SPSS different from most other programs is the use of labels. This column right here shows variable labels. And the idea is we have a short one word variable name over here on the left. And if you use a very old version of SPSS, they were limited to eight characters. And you ended up sometimes with very cryptic names. You don't have quite the same restrictions anymore. But what's common is to give a short name to the variable, and then to give it a label that is more descriptive. In addition, you can have value labels. So let's come here to marital. And we click on this. And this is a way of telling SPSS that in that column, zero means unmarried and one means married. Obviously, you can make them whatever you want. And when you come back here, you have the option of seeing them. So I'm going to come right up here. And I'm going to click on this one, which will show the value labels. And you see how they've appeared now. I can have them go away. And with the variables, if I just hover over, then I see the longer name. Going back to variable view, you can also specify values for missing values, you can give the width of the column, the alignment, and then you can specify the scale of the rephrase. And then you can specify the level of a measurement. Now SPSS uses three values, scale, which is a interval or ratio level measured variable, ordinal, which is ranked data, and nominal, which is categories. You also have the option of specifying whether something is an input variable, a target variable, or both. And there are certain functions that use those. But most of the time, that's not a big deal. And you see that in this demonstration data set, those haven't been changed at all. So the first window in SPSS is this data window. But there's more to SPSS than that. So for instance, let's make a very quick graph. I'm just going to make a simple chart here. Come and make a histogram of age. And hit OK. And so you see I have a graphical user interface with drag and drop menus that allows me to assemble my commands this way. I hit OK. And then what we get is another window that opens up. It's super tiny up here. So I'm going to make it much bigger. And this is the output window. So it's a separate window, the data is in one window. And when you do an analysis, you get a separate output window, you can actually have multiple output windows. And what this one does is it has the graph or any statistical analysis we do. It also has a table of contents over here that you can collapse things or you can expand them. And an important thing is I've got it set so it shows the code that SPSS generates behind the scenes to create this analysis. And the neat thing about that is you can actually use that code and you can manipulate it directly. This code is called syntax in SPSS. Now by default, SPSS opens up only a data window and an output window. But you can get a syntax window as well. In fact, let me do that. I'm going to come up to File, New, and Syntax. And this is a very blank window. But it's one that you can type in. Or you can also use the drop down menus to put a command in there. So I'm going to come back here to the recent command. And I did a histogram. And I could press OK again. But now what I'm going to do is I'm going to press Paste. And what that's going to do is is going to get the code for that chart. And it's going to put it right here. In fact, this is the part that we use. And if I select that, I can hit Run. I can also do command or control R. It runs the selection. And you'll see we get the output window again. And it's done the exact same thing a second time. But this time it did it from a window where I'm able to have the text now a lot of people are uncomfortable with syntax and they like the drag and drop menus. But a really important thing about this is it allows you to save your analysis. So you can repeat it again without having to go through all the menus. You can simply paste the syntax from the dialogues into a syntax file. And then you can repeat it as many times as you want. It's also really easy to modify things when you do it that way. And syntax files are just plain text files. They're saved with the dot SPSS extension, but they read just like plain text files. Now these are the most important elements of the SPSS environment. The data window was both the data and the variable view, the output windows, and the syntax windows that allow you to save the commands. And this is what gives SPSS both some of its flexibility and its power and as you become more comfortable moving back and forth between these various windows and seeing what you're able to do, both with the dragon drops and by typing text, you'll discover there's a great amount of flexibility and power in SPSS that can allow you to do the analyses you need to do and get the insight you want from your data. We'll continue our introduction and discussion of setting up an SPSS by taking a look at the sample data that comes as part of the SPSS application. The really nice thing about that is it allows you to get started now and start working with things and see how SPSS works. The hard part, however, is that it's totally hidden. And so you need to know where to look in order to use the sample data. Now if you're on a Macintosh like I am, then it's going to be in your applications folder under IBM SPSS statistics, 22 or whatever version you're using, then samples and then in English, then you'll have them. In Windows, it's a little bit different. It's going to be C, program files, IBM SPSS statistics, 22 or whatever version you have, samples and then English. So you have to navigate to that manually in order to be able to find those. But when you do, you'll see a bunch of files there. Now there's a few kinds in particular that are important. There are the dot SAV files. These are data files in the proprietary SPSS format that can only be opened up in SPSS usually. And they're also dot SPSS files. And these are SPSS syntax files, they're text files with the commands that can run a number of analyses and graphs and other functions in SPSS. Now, we can try it in SPSS by having you on your computer, open up the window, and opening up a file called demo dot save. But let me show you how it works. When you navigate to the folder with the SPSS sample files in it, again, it's several hidden layers down. These are the files that you'll find these are the dot SAV data files. And these are the dot SPSS syntax files. Now there are other things in there. There's something called a CSA plan, that's an analysis plan. There's an XML file, and there's a few other things in there. But the majority of what we want to deal with, in fact, rephrase. But the only ones that we're going to deal with are the dot SAV files, and possibly the dot SPSS files. Let's scroll down here until we find demo dot save. Now please note, there's a lot of other demo files around that. So you want this one in particular demo dot SAV, because that's the SPSS file. I'm going to double click on that. And SPSS opens up the file. Now you can set SPSS so it has only one data file open at a time, or you can have multiples. I'm going to close this empty file right here. But here is our demo file. And this allows us to start working with a lot of the analysis and see how they work. In fact, I'll be using this file all the way through this entire course, because it allows you to do a number of analyses that require specific kinds of data. And this has it all set up. So I'll show you a very quick one, I'm going to come up to analyze and to explore. And I will get level of education and put that in. And so I have a long list of variables that I can work with. These are all the same variables. I'll just hit okay. And that opens up my output window opens up microscopically here in the top corner. So I'm going to make it bigger. And now I'm able to start working with my sample data. And that allows me to get some hands on experience to see how the functions work in SPSS, and to try some of the options and see how they affect things. Our next step in SPSS and introduction is to look at basic graphics, because those are always a good first step in analysis. And the easiest way to do that in SPSS is with something called graph board templates. Really, you can just think of these as graphs made easy. The idea here is that if you set the levels of measurement in SPSS, then SPSS can suggest graphs that would be appropriate for those variables. Now in terms of level of measurement, remember SPSS uses three. Number one is nominal for different categories. Number two is ordinal for ranks. And number three is scales. That's for interval or ratio level measurements. And then when you're in the graph board templates, you have two basic choices. You have basic graphs. And those are where you choose the variables first that you want to graph. And then SPSS will show you suggested graphs, you can see what you want to do with them. There's also an option for detailed, and this is where you choose the graph style first, and then you choose the variables that go into it. Now these aren't exclusive, you can bounce back and forth between the two tabs. And it'll be easier to see how it works if we just go to SPSS. If you're logged into datalab.cc, then you should be able to download the exercise files from the same page of this videos on open up this file SPSS01 underscore three underscore one underscore graph board dot SPS to syntax file. And let's see what it looks like. Syntax file that you've opened looks kind of complicated. But this is really because I want to have a written record of the same things that we're going to do with the drag and drop menus in the graph board. We do need to open a data set. And as I mentioned before, depending on whether you're on a Macintosh or on a Windows computer, the path to the data sets is a little bit different. And also depending on the version you're using, I'm using 22. And so if you're using something else, change that number right there, most of it should be the same. And you can run this command and open up the data set and activate it. Now I've already done that. I'll show you. There's my data set right there's the demo dot save. And we can come down here to variable view. And see the levels of measurement that SPSS is assigned to these. Most of them are scale, we have a few that are ordinal, we have only one variable in this data set that's truly coded as nominal. And that's gender, which is actually a string variable in this case. I'll go back to the syntax. Now, I have some rather complicated syntax here. But what you'll see is that when we use the menus, it's actually pretty simple. The first thing we're going to do is make a chart of age. But I'm going to come up here to graphs to the graph board template chooser. And when I come to that, you see I'm in this tab of basic graphs. And this is where I choose a variable. I'm going to choose age right here. And it recommends three different kinds of charts a dot plot a histogram and a histogram with a normal distribution. We'll take the very first one that's available dot plot and hit okay. It puts in the output window, which I have to maximize. And there it is, it's a dot plot, looks a lot like a histogram of age and years. So it goes down to 18 years, it looks like it goes up to about 77 78. And it's an easy way to get a feel of the distribution that we're dealing with. Again, the command in text and syntax is complicated, but the graphical interface makes this very easy to do. I'll go back to the syntax for a moment. If you were to paste the syntax for that command, this is what you would see right here. And it's a way of saving it. And you can modify it manually if you want. Now we'll do a histogram of age with a superimposed normal distribution. Again, I'll come up to graphs to graph board template chooser. And this time, all I have to do is come over to the right. I click histogram with normal distribution and hit okay, expand the output window. And it's really simple. Now both of those charts that I showed you were with age, which is a ratio level or scaled variable in SPSS terminology. We can also do this with categorical variables. I'll use gender and make a bar chart. Come back up to graphs, hit graph board template chooser. And when I come down to gender, you'll see that the recommended charts change. Because this time it knows it's a categorical variable. Now, if I had GPS data, I could put that in here, I can do a bunch of different things. I'm just going to do a bar chart because that's the easiest to deal with. I'll hit okay, make the output window bigger. There's my bar chart. And you see that in this particular data set, we have an almost exactly equal number of men and women or data on them. Now, those were the basic charts where you choose the variable first and SPSS recommends particular graphs. You can also do detail charts. These are ones where you choose the style of chart first, and then you feel on the variable. I'm going to do this again for a dot plot of income, and then show you that it's really easy to modify it. I'll come up to graphs to graph board template chooser. This time I'll go to the detailed tab, click on that. And I'm going to make a dot plot. So I'm going to scroll through this, you see we have a lot of choices. I'm going to choose dot plot. And then it's going to ask what I want to make a dot plot of. And I'm going to click on this, and I'm going to scroll to income. See the one that I want is right here, household income in thousands. I can click okay, then expand the output window. And here's my chart. It's a really basic chart. And you see that most of the people are at the low end, especially because this is hundreds of thousands of dollars. So that's going to be a million dollars right there. But I want to show you an interesting thing about this. If we double click on the chart, that opens up the edit window. And the graph board editor has some special options. For one thing, I can change the number of decimal places here, I just click on the decimals, come to format and change the minimum level, or rather the minimum number of decimals to zero. That's better. But a more interesting one is if I click on the dots themselves, they're done as points, and the modifier is to pile them, there are a few other modifiers that can be useful. One is to dodge them. And what that does is it puts them in the middle, expanding out either way. It might be a little harder to make comparisons from one level to another, but it's an interesting kind of chart. I can click on it again. And we can do what's called jitter, with a normal distribution. And that takes points with the same value, and it kind of randomly spreads them out up and down. And again, you can see that we've got a whole lot there at the bottom. One other choice is jitter uniform, which makes them stay within certain boundaries. But it's hard to tell really how much things are spread out there at the bottom. So I actually prefer pile or I think dodge is interesting in this case. And so that's one way of using graph board to both set it up and then to manually modify it by double clicking on the chart. Can close this because I'm done with that. And you see I have the modified version right there. Now, we can get a lot more complicated. So for instance, I can make a scatter plot of age and income with colors for point density. There's a lot of options and you can explore them. This time I'm going to do a little bit differently. I'm just going to select this command. And again, the way I got these was by setting them up in the menus and then simply hitting paste. And it put the syntax into the syntax file so I could save it and run it later. And so I'm going to show you how that works. I've got the command here that I created using the graph board template chooser. And I'll simply come up and select run selection. And I maximize that window. And there you can see I actually have what's called a hex scatter plot. And it's showing a few different things and it's a really neat way. And so you have a lot of options on the way you display things in the graph board template chooser. And while the code is complicated, the interaction with the menus is really simple. You can be creative and you can get different views on your data and try to get more insight as you're doing your analysis. The next step in our introduction to SPSS and basic graphics is bar charts. And we like bar charts for a very simple reason. They are simple and simple is good. Or more specifically, bar charts are the most basic graphic for the most basic data just frequencies for a simple category. It's also a very basic command in SPSS. Now, we actually have a few options on different kinds of bar charts. One, we can make a simple bar chart. So a single variable simply showing the category frequencies in that variable. Two, we can do a grouped bar chart where we break it down by some other variable. And then three, we can do multiple variables and show the bars simultaneously. But let's try this in SPSS. It's really easy to do. Just open up this SPSS syntax file, and we'll give it a whirl. Once you've got the file open, you'll need to open the demo data set, we've used it before. This is the command for Mac, if you're running 22. And this is the command for windows if you're running 22, just change the version number if you need to. Once you have the file open, we're going to make some bar graphs. Now I'm going to do it by coming up here to what are called the legacy dialogues. These are specialized one graph only dialogues that come from earlier versions of SPSS. And seriously, I usually use these because I find them so quick and easy to deal with. What we're going to do is we're going to make a bar chart for levels of education in our sample. So I'm going to hit bar. We're going to do a simple bar chart. And we'll do groups of cases. And all I need to do is hit level of education, put it into the category axis, and hit okay. And I make the output bigger. There it is. Absolute piece of cake. And it's also very, very simple syntax, you see the syntax right here. It's really could be one line. And just as a point of comparison, here's the same chart produced with the chart builder, but you see we have this really complicated overwhelming code. The legacy chart produces an extremely simple way. So that's a simple bar chart, piece of cake. Now let's do a clustered bar chart for groups of cases, we'll look at levels of education by gender. To do that, we come back up to graphs into legacy dialogues to bar. And now we're going to cluster it into a level of education clustered by gender. So I hit define. And we'll get level of education. That's sort of our outcome variable. Put that under category axis and then define clusters by gender. Let me put that right there. I'll hit okay, and make it bigger. And this time it uses nicer colors. But you have the five levels of education broken down, where women are in blue, men are in green. But it's really easy to see here, the relationship between the two variables. And in this particular data set, it really looks like there's no substantial difference between the men and women. Now I'll say I believe this is an artificial data set. So we wouldn't expect a lot of differences. But this is a nice way to compare them. By the way, come up and you'll see that the code for this is really simple. All it does is it adds by gender. And so again, a very short command. I'm going to go back to the syntax, and we're going to do one more here. And that is for multiple variables. So this is a situation in which it can be confusing if you have a lot of categories within each variable. What I'm doing here is I'm going to get the means of variables or the numbers of ones. If you have an indicator variable where it's a zero for no and a one for yes, this is a really nice way of comparing the frequencies of each one of them across. I'll show you how that works. We'll go up to graphs. We'll come back over to bar. And we're going to do a simple one. But this time we're doing separate variables, I'll hit define. Then I'm going to come down here and this data set, again, which I believe is fictional, asked a lot of people about various things that they might do. We're going to ask them about wireless service. And we're going to come down to whether they own a fax machine, because this is old data, and it's asking about old technology pagers, I've never had a pager. But I simply select all those variables, I put them in here. And as long as they're all on the same scale, it's going to do the mean of each one. And on a zero one, the mean is the proportion of ones hit okay. And there we have it. It's a way of looking at the distribution of multiple variables simultaneously. So very information dense display. And especially when you as an analyst are exploring your data, this can be a really quick and easy way of getting a feel for your data, which can then direct your further analyses. As we continue to look at basic graphics and SPSS, a really common one is histograms. And this is a graphic for data that is quantitative or scaled or measured, or interval or ratio level, those really all are referring to basically the same thing. And in any of them, you're going to want to make a histogram to see what the variable is like. Now I'll mention that SPSS prefers the term scale for these variables. And that's what shows up in the data definitions. And I like to think of it as the scales of justice. But why are we making a histogram? The point is to see what you have to see what the data is like. And there's a few things in particular that you're going to be looking for. Number one, you're going to be looking for the shape of the distribution is it unimodal, bimodal, skewed left skewed right. Are there gaps in the data that suggests that maybe you have some important mechanism operating? Are there outliers that you would need to take consideration of before you do your analysis? Is your data symmetrical? There are a lot of different things that you could look for. And some of these are going to have a lot of influence on your analysis. So it's important to take a look at the data. And histogram will give you a great impression of a quantitative or scaled variable. We'll try it in SPSS simply open up this syntax file. And we'll see how it works. When you're in SPSS, most of this is really just to open up the data sets, the same one we've used in the others is demo data set. And here's the code for Mac, adjust the version number if you need to. And here's the code for Windows. But once you have the data set open, you can use the commands, and it's really, really simple. All you need to do is come up to graphs. We'll go to legacy dialogues, and we'll come down here to the bottom to histogram. And we're going to make a basic histogram of age. So I click that, and I come to age, it's our first variable. And I simply click this to move it over and hit OK. I'll make the output window bigger. And there's our histogram. And from this, we can see that our distribution is unimodal. We can see it's pretty close to normal. It's slightly skewed on the high end, but not very much. And this is going to be a really good variable for most of our analysis, because at least most of the assumptions of the kinds of procedures that we might want to use. Now, if I want to make things slightly more complicated, because you see that the command for this is extremely simple, we can make a small modification. I'll show you here, we can superimpose a normal distribution. And all I have to do for that is come back to graphs to legacy dialogues into histogram. And I just check this box right here display a normal curve. And what that's going to do is going to create the same distribution, but it's going to put on top of it, a line of a bell curve or normal distribution that has the same mean and standard deviation. And here you can see that we're pretty close to normal. And this is a nice way of confirming that. And again, the code for it is really simple, all it does is it adds the word normal in this sentence. And that gives us everything we need. So one of the reasons I really like the legacy dialogues and SPSS is because it's so concise, it's so simple, and it gets you what you need. So you can get a grip of your data and move ahead. As we continue SPSS and introduction and basic graphics, we should look at scatter plots a very common method of looking at associations. Or as I like to think as way of assessing togetherness in data. In other words, you want to see what goes with what or more specifically, what variable goes with what other variable. So scatter plots are a great way of visualizing the association between two quantitative variables. When you make a scatter plot, there are some things you should look for. And in case you're wondering what they are, they include, for instance, whether the association between the two variables is linear, because a lot of the procedures that are common, assume that you can draw a straight line through the data. You want to check the spread of the data, especially whether the spread changes as you go from left to right on a scatter plot, that's called heterogeneity of variance, and it can cause problems with certain procedures. You want to look for outliers, either univariate, that's a score that's unusual on a single variable by itself. Or in this case, what's even more significant is bivariate, where you have an unusual combination of scores. And then finally, you want to try to get some idea for the correlation or the strength of the association between the two variables. And scatter plot will allow you to do all of those. Now in SPSS, there are three general kinds of scatter plots that you can do. Number one is a simple scatter, it's a bivariate x and y chart, easy to do. Number two is a matrix scatter plot where you actually have several variables and they're simultaneously. And it's a good way of looking at complex associations between collections of variables. And number three, SPSS is able to do a 3d scatter plot, but I'll have some words to say about that a little bit later. But let's try this and see how scatter plots work in SPSS, at least very basically. So just open up the syntax file, and we can see how it works. When you open up the syntax file, we have the same situation where you can load the data, we'll use demo dot save. And you can use this command if you're on a Mac using version 22, and this command on Windows version 22. But we're just going to make a couple of scatter plots, and it's a really basic easy command. The first thing we're going to do is make a scatter plot of age and income. But let's come up to graphs, to legacy dialogues, and down to scatter. I'm going to use a simple scatter that's just a basic bivariate x y chart, I'll hit define. And all I need to do here is pick my variables for the x axis across the bottom and the y axis up the side. I'm going to pick age for the x axis and put it right there. And household income for the y axis and the idea is maybe there's an association between household income and how old a person is. That's all I need to do, except click OK. And when I get that, I get this basic scatter plot. So I have age and years across the bottom, I have household income and thousands up the side. And you can see, of course, that most of the people are near the bottom, that's because most people make less than $200,000 a year. This graph goes up to 1.2 million. We have a marker that's a large empty circle that's in black and you can change the markers and there's things you can do to clean up the chart. But it's also easy to tell the people who, for instance, make a lot of money are generally older. And so we can see in this data, there is some kind of association between age and income. But let's try to get a more nuanced one by looking at several variables simultaneously with a scatter plot matrix, come back up to graphs and legacy dialogues, and down to scatter. This time, however, I'm going to pick matrix scatter, click define. And then all I need to do is pick the variables I want to include. I don't have to specify X or Y because they're all going to serve as both X and Y in different parts of the matrix. I'm going to pick a few here, I'm going to get household income, I'll move it over. I will get age and move that over. I'll get address years at current address, move that over. I'll get reside, which is the number of people residing in the house, move that. And then finally, I'll get level of education. There's nothing especially meaningful about these, they're just ones that I thought would be easy to look at. Now, as a general recommendation, if you do have one variable that is an outcome variable, you might want to put that one in first that puts it in the first column in the first row, and it makes it easier to find it when you're looking at your analyses. But I've got my five variables in there and I just come in and press okay, take some moment. And then I come up. And this is the scatter plot matrix. And so you have all five variables listed on the side, you have all five variables listed across the bottom. So each one functions as both an X and a Y, you have empty boxes down the diagonal, because that would be each variable with itself and the correlations always one. Now, there are things you can do to clean this up, you can change the marker from a big black circle to something that's smaller and easier to see you can put regression lines through. But it's easy to see that there's some really important patterns. So for instance, age in years, and years occur in address right here, obviously, there's a limit, you can't live someplace longer than you've been alive. That's why we have nothing in the top left at that. But you do see some associations and some cutoffs that go through. Now, this one's really dense. In a lot of situations, it's going to be a lot easier to see the patterns that's there, especially if you change the markers and put in regression lines. But this gives a good idea of what you can do with a scatter plot matrix. Now, let's go back one more time to the legacy dialogues and to scatter. Because you saw that there were other options there. There's a dot plot that's like a histogram, there's an overlay scatter, which I don't want to deal with. And then there's a 3d scatter, and you might look at that go cool, it's interactive, it's 3d, it's a great thing. I'm actually not even going to do it because every time I've done a 3d diagram, I found it's impossible to read it clearly. It's very hard to manipulate an SPSS. And it ends up being really a bad experience. And it's much easier to look at the association between variables using a scatter plot matrix. That's why I recommend that you avoid the 3d completely, even though it's available here, but avoid it completely and use the bivariate and the scatter plot matrices as a way of looking at the associations between variables in your data. Once you've done the basic graphics for your data and seen what you're dealing with, it's a good idea to move on to basic statistics. And then SPSS, the most basic version of this is frequencies. I like to think of it as putting things into buckets, and then simply counting what's in the buckets. So the idea is, when you have a limited number of categories in your data, then you should just count how often each category occurs. It's a first step to really some significant insight. But wait, I just want to mention that the frequencies command in SPSS can do so much more than that. And I'm going to show you how it works. For example, it can do charts, it can do bar charts and pie charts and histograms and normal distributions. And they can do a lot of statistics beyond frequencies, it can do quartiles, percentiles, mean, median mode, standard deviation, variance, skewness, kurtosis, and so on. In fact, because of this, I like to think of frequencies as SPSS's version of the competent man character in literature and movies, who can do everything well, you know, somebody like Leonardo da Vinci or Iron Man, who seems to be able to do everything, or, you know, Marie Curie right here, because she won two Nobel Prizes and what have the rest of us done. But anyhow, back to statistics. Let's take a look at frequencies and let's try an SPSS. Just open up this syntax file, and we'll see the things that it's able to do for you. As always, we need to begin by opening a data set, we'll use demo dot save, and you can use this command in Mac or this command in Windows to do that. Once you have the data set open, it's a very simple thing to get the frequencies. Now I have the syntax saved here. But really, it's more as a record of what I've done, because I use the dropdown menus to create these commands. So I'm going to come up to frequencies, and I'm going to get the frequencies for gender and job satisfaction. To do that, I come to analyze to descriptive statistics, and then the first option there is frequencies. And what I'm going to get is gender, which is right here, I'll just double click to move it over. We'll also get job satisfaction, I'll double click and move that over. Now, what's important is these are two different kinds of variables, gender is a categorical variable nominal, and job satisfaction here is a scaled variable. And so normally you don't do the same kinds of things for these, but frequencies is very flexible. So I'm just going to hit okay, and we'll see the default output for frequencies. The first thing that it shows us is how many valid observations are. So how many of our 6400 cases have data on these variables? The answer is all of them. There's no missing data here. And then it comes down and it gives us frequency tables where it lists every value or possible score on the variable and then says how often each one occurs. So for gender, we have 3179 female respondents, that's 49.7%. And the percent and the valid percent would be different if we had missing data. But we don't so we can ignore that and then the cumulative simply adds up to 100. And then job satisfaction, this is a scaled variable which has 12345 as answers. And here you can see how many people put each of the answers. 17% highly dissatisfied 21.8 neutral 19.1 highly satisfied. And that's a quick look at the frequencies that we're dealing with. It's a nice way also to check if your variables are coded well. But what we can do is more than that, we can also turn off the tables. And we can do bar charts using the frequencies command. So I'm going to keep those same two variables gender and job satisfaction. But this time, I'm just going to make bar charts. I'll go back to my recent commands frequencies. And what I'm going to do is I'm going to click this. It's going to give me a little error message because I haven't changed the other thing first. And I'm going to come to charts right here. I'm going to tell it to make bar charts, obviously can make pie charts and histograms as well. I'll click continue, and then click okay. And now the same general command frequencies is not producing tables, but is producing charts. And here you can see that we are very closely matched in terms of the number of male and female respondents. And here you can see job satisfaction sort of peaks at neutral and somewhat satisfied. And so that's a really nice thing you don't even have to use the bar chart command, you can do it right here. You can also get more kinds of statistics in there. So for instance, this one, I'm going to keep the tables off, but I'm going to ask for a few extra things. In fact, let me just come back to this one. We'll go to analyze, descriptives and frequencies. And this time I'm going to do age, reside and job sat. So I'm going to remove my one categorical variable here, just reset that. I'll do age, or the other two reside and job sat. And then I think that's this one right here. And we'll come down to job satisfaction, and we'll move that over. So I have three variables, but they're all scaled variables. What I'm going to do here is first I'm going to come to statistics. And I have a really an impressive range of things I can get. I can get the mean I can get the median that the mode if you want the mode, I think this is the only place to get it in SPSS. I can get quartile values. Now it doesn't do the minimum the maximum, you have to select those separately down here. But you can also get cut points. Now, a couple points, an interesting one, the quartiles are cut points, it splits the data into four equal size groups with the same number of people in each. Sometimes you want something other than that. So for instance, I know that if you're doing propensity scores, it's not uncommon to use five equal groups quintiles. And also, there are situations in which you want not the most extreme scores, but near the most. And so I'm going to put, for instance, the 2.5 percentile. And the 97.5 percentile because those frame the middle 95% of the data. I can also get the standard deviation in the variance. Is there anything else I want right here? I want skewness and kurtosis. I'm going to hit continue. Then I'm going to come back to this one. I'm going to turn off the frequency tables because otherwise I have a lot of different possible answers here. I'd have a lot going on. I'll hit charts. And this time I'm going to ask for histograms and we'll put a normal curve on top of each histogram. Click continue, and click OK. And so here's what we get. It starts with the statistical output. Here are the three variables I selected. It gives us the mean, the standard deviation, the variance, skewness and standard error of skewness, kurtosis. We have the minimum and maximum scores. And then the percentiles. Now it's a funny list here because I've got three things intermingled. I have the quartiles. That's something I asked for. So we have the 25th percentile, the 50th percentile and the 75th percentile. Those are the quartile values. I had the minimum and maximum up here. So those are the zero and 100 percent quartiles. But I also asked for quintiles. And so that splits it at 20, 40, 60 and 80 percent. And then finally, I manually entered the two and a half percentile and the 97 and a half percentile. And so they're all put there together, but it's really easy to see the changes in the distribution. Beneath that we have the histograms. And we have each variable has its own histogram, along with a normal distribution with the same mean and standard deviation laid on top. Age is pretty close to normal. Here's a current address, however, you can see is really skewed because most people haven't lived there that long. And then finally job satisfaction is a little flatter than we would expect if it were normally distributed. The point of this is that I'm able to do a tremendous amount of statistical and graphical work using a single command, the frequencies function in SPSS, one of the most versatile commands you'll ever use. In our previous movie, we looked at the power of the frequencies command. But for basic statistics, another very common choice is descriptives within SPSS. The neat thing about descriptives is it allows you to achieve maximum density. That is how to get a lot of numbers on a lot of variables in just a little space. That's what descriptives is really good for. On the other hand, there is a restriction, it works only with numerical variables. But that's a lot of the data that you might have. And if you have that, it can give you things like the mean, the sum, the standard deviation, the standard error, the variance, the minimum and maximum, the range, the skewness, and kurtosis. Now, I say, but guess what, you know, in case you don't remember, frequencies does more, but that's okay. There are certain things that the descriptives command does well. Here's what it does well. First, it gives you very concise compact tabular output. So it's really easy to see a bunch of information in a small space. Second, it's a really quick way to find obvious errors in coding in your data. Finally, you can get proportions for indicator variables, that's zero one variables, and I'll show you how that works. Also, we have a bonus feature here in descriptives. Descriptives is the home of SPSS's top secret hidden one step Z score procedure. I've seen people knock themselves out trying to get Z scores by getting standard deviations and means you don't have to do any of that you click one button and you're done. But let's try it in SPSS and I'll show you how it works. Just open up this syntax file. And we'll see what you can do with descriptives. We'll begin as always by opening the data set we'll be using demo dot save. Here's the path on a Macintosh running version 22. And the path on a windows also running version 22. This is my first command, and it looks really long, but that's because I have a lot of variables in it. All we need to do is come up to analyze to descriptive statistics and descriptives. We click on that. Now one of the things it does is it only shows you the variables that it can analyze. So gender, which was a string variable, meaning I had just text, that's not in there. But what I can do is I can just select all of them and do a command or control a, and then move everything over. And then I'm just going to do the default analysis, I'll just hit okay. And here's our output. We have a whole bunch of variables. And it tells us first the number of observations at 6400, almost all the way down this question about internet is missing some data. But that appears to be the only one. We have the minimum value and the maximum value. By the way, this is where I talk about quick and easy data checking. If you have a variable that's only supposed to go from one to five or zero to one, if you have a 17, you know something's wrong. And so by simply checking the outer boundaries, that's a fast way of seeing if there are any really obvious errors. We also have the mean and the standard deviation two of the things you generally need the first two moments of a distribution. And so that's a lot of information and it's in a very concise format. That's a wonderful thing. Now if we go back to the syntax, I do want to mention this one thing about indicator variables, I said it earlier. It's this, if you have indicator variables, that's a binary or dichotomous variable that has only two possible values. And if that variable is coded as zero and one, then you can in fact get the mean of it. And it tells you something that tells you the proportion of observations that have ones. And this works best if you use the standard programmer format of zero equals false or no, and one equals true or yes. And strangely, in this particular data set, that's true for most of the variables, but not the last one or two and demo dot save. And I have no idea why they switched that. But it's something that you want to check in the coding before you go ahead and do it. And so if I go back to the output, you can see, for instance, that most of these wireless service down through owns fax machine, those are all zero ones where zeros know and one is yes. The mean right here tells us that 99% of the people on TVs 96 on VCR is because this is a long time ago, that 25% had paging services. And I like this one, where's the internet on this list? 27% had the internet because this was apparently generated like, you know, 1990, who knows what. Anyhow, those are meaningful data points, the mean tells you the proportion of ones or yeses. I'll go back to the syntax here. And then let's take a quick look at the Z scores. Now any reasonable person would think that a Z score is a transformation of the data and therefore it would be under the transform menu. But you know what, it's not there. Instead, it's hidden as an option in descriptives. So let's go back to descriptives. And let's do age and income. So I'm going to reset this. I'm going to pick age. And I'm going to pick household income. And I'm going to get both of these as Z scores, because a lot of procedures work a lot better if you have Z scores. All you have to do is this, click save standardized values as variables. And if I hit OK, what it's done here is it gives me the descriptives because I actually still ran the descriptives command for those two variables. But more significantly, let's take a look at the data set. When I come to the data set, if I scroll to the end here, I have two variables that were not there previously, Z age, and Z income. And they have lots of decimal places because you need those with Z scores. Now, I'm rephrased now under normal circumstances, you would want to save this into the data. I'm not going to do that because this is one of SPS has built in default data sets. But I do want to show you that we can do one other thing here. Let's go back and get descriptives for those Z scores. So I'm going to analyze descriptives. I'm going to reset this. Come down to see our two new variables. I'll select do a little shift click to get both of them, then pop them over here. Then I'll hit OK. And as you would expect a Z score has a mean of zero and a standard deviation of one. And we didn't have to do it manually, we didn't have to remember any values, we didn't have to round things off and do that exactly for us. And so that is what the descriptives command does. It makes a very concise tabular output. And it also allows you to say standardized or Z scores for use in certain procedures. For a final look in SPSS at basic statistics, we'll look at the Explorer command. I like to think of this as a way to get a lot closer, get a little macro view on your subject, get closer and see what's there in detail. Now the Explorer command is going to give you a bunch of statistics, it can give you the mean and the confidence interval for the mean, and the trimmed mean, as well as the variance, the standard deviation, the interquartile range, the minimum and maximum, the range skewness kurtosis, a collection of M estimators, which are special robust ways for measuring the center of a distribution, percentiles, which we've seen before, and lists of outliers. It can also give you a collection of plots. It's the one place in SPSS that you can get a stem and leaf plot. Now traditionally, those are things that are drawn by hand. So it's kind of cute to see a computer do them. You can also get box plots. And you can get histograms. And you can get a set of normality plots such as a QQ plot or a D trended QQ plot. And the neat thing after that is you can break all of these analyses down by groups. So let's try it in SPSS and see how it works. Just open up this syntax file, and we'll run through the various procedures and explore and see how it can add to your own analysis. As always, we'll begin by opening the demo dot save data set. Here's the command for a Mac, here's the command for Windows. Now again, I'm saving this as syntax, that makes it repeatable, and it means so you can download it and try running it on your own. But I created all of this by using the menu commands. Let's start by doing a default explore analysis for a couple of variables. I'll come up to analyze two descriptors, and then we'll come here to explore. And what we're going to do is age and income category. And again, this is kind of interesting because these are different kinds of variables. Age is a scale variable. And income category, in this case, is an ordinal variable. I'm just going to leave all the defaults as they are and hit okay. And here's what we get from this. First, we find out whether there were any missing cases there weren't in this situation. And then we get a collection of descriptive statistics for these we have first for age, then for income category. We have the mean with the standard error, the confidence intervals, the 5% trimmed mean, median, variance, standard deviation, minimum, maximum range, inner quartile range, skewness and kurtosis, along with their standard errors. And so there's a lot of information there. And we scroll down we find the same kinds of information for income category in 1000s. Now remember, some of this you wouldn't normally want to use because income category in this case is not a scaled variable. And a lot of these things like minimum, maximum and trimmed mean work best with a scaled variable. But SPSS is able to kind of run it on everything. So interpret with caution. Then we come down and look, we have a stem and leaf plot, where this is age, which in our sample is two digit numbers. And so this means 1818. And each of these leaves each of these numbers over here is a leaf that represents 10 cases. Remember, we have 6400 cases. So we have about 640 numbers right here. And you can see, for instance, that the 30s appear really common late 30s. And that we go up to somebody in their late 70s. And so that's an easy way to see what's going on. Simultaneously, we get a box plot. And the nice thing about this is you can tell really quickly, they're no outliers on age, not in this particular data set. We do the same thing with income category. And the seven leaf plot looks funny. But that's because there's only a few possible values, one or two or three or four. And it's drawing it. So it looks a little weird. But we can come down and get the box plot as well. And we see there's no outliers at least on this kind of variable. Again, not normally something you would do with a rank order variable. But it's possible here. Now the neat thing is there are additional statistics, I'll do the same two statistics, but I'm going to go check off a lot of options that I have right here. So let's go back to that dialogue, I'll go to explore. And what I'm going to do is I'm going to say, just give me the statistics right now. And I'll come up here and I'll make some selections. One thing, although 95% confidence intervals are by far the most common, I have seen significant situations where people used 80% confidence intervals. So you can change it if you want. Then I can get all of the estimators, it's a whole collection. I can get a list of outliers and a list of percentile values. I hit continue and I click okay. And now we have the same table we had before, that's their descriptives up there at top. Then we have the estimators. And this is four different robust measures of center. Again, all of them are trying to give us something equivalent to the mean. And you see in this case, Huber's estimator two keys by weight, samples estimator and entries wave, the numbers are all pretty similar. I mean, it goes from a low of 41.18 to a high of 41.52. But they're all really close. And each of these has specific parameters that go into them, you can't adjust them in the dialogue box. But let me just return to the syntax for one second. You see here, these are the parameters for each of the estimators, you could change them here if you wanted to. I'll go back to the output. Then we have percentiles 51025 up to 95. And then it gives us the case numbers for the highest and lowest five cases on each variable. And so this is a really nice way of seeing a multi dimensional picture of our data. Now, in terms of pictures, an even better way to do this with more graphs. So let me go back to the syntax for a second. And you see that we can get some additional plots. I'm going to use age and income category again, but I'm going to change the what it tells us. So first off, I'm going to say give me just the plot. So we're not going to get any statistics. And come to the plot span you and say, Well, we have a stem leaf by default. Let's get a histogram. Let's also get normality plots. That's a way of assessing how closely your data match a normal distribution. I'll hit continue. And okay. And now I have a histogram for age, the stem and leaf plot. But this one here is normal. But this one here is new. It's a normal qq or quantile quantile plot of age and years. And if it were normally distributed, all of these circles would fall exactly on this line. You see, it's really close, but it does deviate at each end. And then a detrended one takes that line sort of flattens it out. And it's much easier to see where the changes are. Now I know it looks really big in this case. But this variable is in fact pretty close to normal distribution. Then we have our box plot. And then we do the same thing for income. We start with a histogram, our stem and leaf plot. And our normal qq plot again, a little weird, because there's only four possible values in this data set, but they all fall pretty well on the line. And there's our detrended plot. And then finally, the box plot that we saw before. Now, there's one more thing we can do with the Explorer command. And that is we can take some of these analyses and break them down by groups. So if we go back to the syntax, we'll see I'm going to do income and break it down by gender. Let's go back to the menu here. Go to explore. And I'm going to reset this. And we're going to take income and put that into our dependent or outcome variable list or the thing that we're trying to predict. And then we'll take gender, scroll down a little bit, there's gender put into the factor list, or sometimes people call it independent variables, that's if it's an experimentally manipulated variable, or the predictor variable. I'm going to come up here and I'm actually going to skip the statistics and get plots only. I don't want to stem and leave, but I will get a histogram. I'll get the normality plots. And now because I'm breaking it down by groups, I can check the spread versus level with the Levine test. The idea here is that the data should be spread out approximately the same amount for each of the groups so we can compare them using some uniform statistics. I'm going to do what's called a power estimation here. Click continue and then okay. And now what we get is again is a list of the number of cases that have complete data and all of them do. There's no missing data. We have a test of normality. And what we see here is based on both of these that the data for neither group is normal. That's okay, because we knew that income was strongly positively skewed. As for homogeneity of variants, whether the two groups have about the same variance or spread, you know, there is some difference, but they are not statistically significant. And so it appears to be the same for the men and the women, which is good in this particular data set. Then we can come down and see the histograms first for women. And you see it's got a really strong skewness there. And the same thing again for men really strongly skewed. Then we get the normal qq or quantile quantile plots. And again, if it were normally distributed, all of these points would fall right on this line is strongly skewed. And so we have this really big bend in the data. The same is true for men. And here's the detrended lines where they should all be flat on that line instead, you get this swoosh mark instead. And so it just confirms that we're not dealing with normally distributed data. And what you do have then what you do have is this big collection of outliers in the box. I'm going to do one thing, I'm going to double click on this. And then I'm going to come right up to here. And this will turn off the data labels so we can get rid of the ID numbers. And you can see that we have a lot of outliers in both the men and both the women. And there's no really obvious differences between the two groups. And the spread versus level plot is something that you can use if you have multiple levels, that it can help you select a kind of power transformation, a square root or reciprocal a square or something like that. But that's a more complicated topic and something for another day. And besides, it appears that we have relatively homogeneous variance in the two groups. And so we'd be good to go ahead and do our other analyses. And so those are some of the options and explore. And that's where we'll end our discussion of basic statistics. But you can see how they can be used to see how well your data meet the assumptions of the procedures that you use. And then really, how well you can make inferences from your sample to other groups. When you're working in SPSS and you're accessing data, one of the most important things you can do is to create labels and definitions for your data. I like to think of this as the statistical version of Alice in Wonderland in the caterpillar, asking her to explain herself, you need to explain yourself or more specifically, when it comes to your data, you need to tell SPSS, what do your data mean? Now, that is the data description. And I see two kinds of information that you tell SPSS about your data. The first one, I'm going to call semiotics, which comes from the study of meaning. This is where you tell SPSS, what the variable names are, the data types, the variable labels, the value labels, the missing values, the level of measurement, and the role that each variable plays. Contrasted with that, there are other elements that you can call aesthetics. And that addresses variable width, decimal places, column width, and alignment. And these are all settings within the data window of SPSS. One of the most important, though, at least for human consumption is going to be the variable and value labels. And so I'm going to take a little time and talk about those. With the variable names, that's what the short names, the ones that you have there at the top of the column, there are some important rules. So the rules for variable names. Number one, the names must be unique, no two variables can have the same name, that shouldn't be too surprising, it's an identifier. Rule number two, the names must start with the letter, I put an asterisk there because you can start with an at a pound sign or a dollar sign, but you don't want to because those are generally reserved for special functions within SPSS. Rule number three, names can use letters, upper lowercase, they can use numbers, and they can use period underscore at pound dollar sign. On the other hand, don't end with a period that can cause confusion with the command terminator. And don't end with an underscore because that's used for automatic variable names when SPSS is doing computations. Rule number four, names cannot include spaces. And rule number five, names must be less than 64 bytes. And most text coding systems, that's 64 characters. But if you're using a unicorn system that might be only 32 characters. And the last rule rule number six is the names cannot be any of these words all and by eq g e g t l e l t and e not or two or with because those are all reserved function names within SPSS. So don't create that confusion. So those are the short names that go at the top of a variable. On the other hand, the label that you associate with that you can give it a more descriptive name. Those are the variable labels. And so there are a few rules for those. Rule number one, they must be less than 256 bytes, that actually means it could be really long. Although you don't usually want to do that because some procedures will display as few as 40 bytes 40 characters. And you really want to be able to read what it is. So you want to keep it short. But you can go longer if you need to. Rule number two, the labels must be enclosed in quotes. Although I'll tell you, they need to be straight quotes, the vertical ones, and not the curly quotes or SPSS chokes on those. Rule number three labels can include any character, including spaces, which is something that you can't have in the variable name, but you can put it here. So that allows you to put labels that sort of float on top of the variable names. And those can show up in the variable lists, they can show up in the charts and the output that you create. Another really important one is value labels. So you may have a variable called gender, and you may put zeros and ones. But do you remember what those zeros and ones are? And so I'm going to show you some ways of dealing with that. The most important thing is to put value labels on there. So here are the rules for value labels. Rule number one, they must be less than 121 bytes. So that actually is really long. You generally want to keep your labels pretty short. Rule number two, like the variable labels, the value labels must be enclosed in quotes, and they need to be the straight quotes and not curly quotes. Rule number three, labels can include any character, including spaces, that's good. This is an interesting one rule number four, the value labels do not need to be unique. That is, more than one value can have the same label. So you might have the numbers one through nine. And it could be that seven, eight, nine all say the same thing. But they underneath have different code and terrorist situations where you might want to do that. But mostly I want to show you how this works in SPSS. So just open up this syntax file. And this one's going to be a little different because we're actually not going to use a data file, I'll refer to one, but I mostly just want to show you the syntax. This syntax file shows how to write variable labels and value labels. Now, you don't necessarily have to put them all broken down in lines, I do it because it makes them a lot more readable, it's a lot easier to see what's going on. The first thing is the command variable labels, because there's an SPSS command, it's written on all capitals. And then what you do is you write the short name of the variable. And then you have at least one space, and then you have straight quotes, and then the long label. So here, for instance, I've got very zero one, that would be the first variable. And then this is its label written out. And you don't need to have anything after it don't need any commas or question marks or semicolons or anything. You just go to the next one. Now I put it into another line because that makes it easy to follow. And I run them all through here. I'm going to make one important recommendation if you have a dichotomous variable or binary one that has only two possible values and gender might fit into that category. Let me recommend this, that you code it as zeros and ones a lot of people use ones and twos but that gets confusing if you code it as zeros and ones and name the variable after whatever the one is. Now, when it comes to male and female, I generally give ones to whichever group I think can have the higher score on my main outcome variable. So it'll switch around. But if for some reason I think that men are going to have a higher score on an outcome variable, then I will call it male. And then the label will be R for respondent is male. On the other hand, if I think women are going to have a higher score, then I will call the variable female and the label will be R is female, I would obviously only use one of those two. Now here are some other examples. I tend to give generic names, such as variable or really just Q for question q 01 q 02. And I use the leading zeros so they sort properly in the dialogue boxes. And when you're done listing all of your variable names and the variable labels and quotes, just end with a period doesn't have to be have a space before that's left over from earlier versions of SPSS it's a habit I have. So you can run this at any time and it will assign these labels to the variables and then they'll show up in the data file, which is nice. Next are the value labels. And what you have here is the first command, which is written in all caps, and then you give a list of variables to which the values apply. And you can list them out separately, ver one, ver two, here, I've got a very three without a lead in zero. And then if they're all next to each other, if they are adjacent, they can actually specify ranges, ver three, two, and capitals, ver 10. So that'll be three, four, five, six, seven, eight, nine, 10. And then you just go to the next line, and you give the first value, that's a zero, and then I give zero equals no, and one equals yes. When you're done giving the values need to put a slash. So it knows you're done with the values for that variable. Then you can go on to the next variable. I said, for instance, if I gave one on a gender variable to men, I would call it male. And so zero, which would mean no, they're not male, would be female. And one, yes, they are or true, that would be male. And do a slash. On the other hand, if you coded it the other way, then you just call it female and zero, which means no or false means they're not female, they're male. One means they are fine. Obviously, use just one of these, I do the slash. And then I could have a rating variable, say, for instance, a lot of people call it a Likert scale, just a rating scale. And I could do rate zero one to rate 10. And I can specify every value. So this is a five point scale from strongly disagree to strongly agree. Finish with a slash, or maybe have a different kind of scale here at the end, I have scale zero once your scale zero two, that's an 11 point scale, but I only mark the two ends the zero and the 10. So zero is never or not at all 10 is always completely. And then to let SPSS know that I'm done specifying value labels and with a period. So this is actually a single sentence. And it's a way of telling it how you want the numbers to appear, both in the data window and in any output that you get. Finally, I'll mention something about missing values, because it can also be easier to specify these in syntax. The command is missing values. And you just give the names of the variables, and you can use two in the same way. And then in parentheses, you put the number that is assigned to missing values, 99 is common. So I've got that there. And then you can do a slash if you're going to use different codes after that, I could do mail through female. And here I say two through high. And really what that means is anything other than zero or a one is missing. So if I accidentally type in a seven, you know, it's missing. And then here I specify several different values, I can put seven comma eight comma nine. So if any of those show up, those would be considered missing values. Do what you want. The nice thing is it will exclude them automatically from analyses, but it will include them in frequencies when you're getting that output. Finish with the period. And then you just run these like you do any other command. And it's going to do a lot to clarify your data and make it easier to follow your analyses and reconstitute your work in the future. When you're working in SPSS and you're trying to access data, you may get the idea of entering data. Well, let me tell you my thoughts. You want to enter data in SPSS. I just see it as an exercise in frustration. It's a pain to do it manually. And I'd say maybe if you're entering 10 or 12 numbers, you know, basically, nothing, it's something that's often referred to as a toy data set. Maybe you could do that. Now, it's also possible to copy and paste data, but I'm going to say sort of because it doesn't work really well. I'll show you that. It's much, much easier to just import the data from a CSV file or text file. And I'll show you how to do that in the next section. But in terms of entering data, let me show you how it works in SPSS. We'll just open up a blank document. And we'll try it. So here's a blank data window in SPSS. I can come right here and I can enter a number. And, you know, unfortunately, if I press tab, it actually goes down, which is an unusual behavior. And you see it gives it an automatic variable name var 0001. Well, if I want to move sideways, I actually need to move the right arrow key. So I'll go this way, two, three, and so on. And then I can hit return. And it goes down, I'll come back to here and I'll go four, five, six, I'll hit tab, and it comes back to the beginning. So it's not the most intuitive behavior. Plus you see it gives it these generic names. That's because you can't enter the variable name directly in this window. Instead, what you have to do is go to variable view. You can also get there by just double clicking on the variable name. Here we go. And you can enter the variable name and you can change other things you want to do. It works, but it's a pain. I'm going to come back here to data view. Now, I mentioned you can import data sort of. So let me show you how this works. I'm actually going to go to a Google sheet that has nothing in it at the moment. And here I'm going to enter a few values of a few different kinds. I'll do 5643. And I'll enter a number J return, I'll go. Okay. So there's some data. I've got two digit numbers. And I have letters, which will be string variables in SPSS. I'm going to copy those. And we'll see how well they paste over in SPSS. So I'm going to go back there. I'll come over here to the side. And I will paste those in. And you see that the values came in and showed up with decimal places. And I can get rid of that. But it's really weird with the string variable with the letters. And so you can copy it. Notice also, I can't copy in variable names, I still have to enter those manually. You can deal with those when you import. But really, this is a demonstration that putting stuff manually in SPSS, it's not a good environment for that. Use a spreadsheet, use Google sheet, use numbers, use Excel, anything, enter it there and then import it. I'll show you that in the next section. And you'll see that it's a much, much easier process. The last thing I want to say in SPSS about accessing data is about importing data. And you know, compared to entering it manually, it just makes me feel like this. And I resort to cheesy clip art to show how happy I am. Because no doubt about it, importing is absolutely the best way to go if you want to get data into SPSS. Now, the nice thing is, SPSS can open text files, it can open CSV or comma separated value files. And even xlsx, that's Excel files, as long as they're formatted, right? Now, what do I mean by formatted, right? There's a term from Hadley Wickham in the R developer community, tidy data, and it's referring to something very specific. It says that your file should have only one sheet. So that's one worksheet, even though Excel can take more than that, that each column should be exactly equal to one variable. And that each row should be equal to one case. And an important thing is no funny stuff in your Excel sheet, because Excel is very flexible. And when I refer to funny stuff, I'm talking about things like macros and formulas and graphs and formatting and comments or merge cells, or headers taking up their own rows or duplicating row numbers. You don't want any of that. Basically, you want to treat it like a CSV file. And if you do that, then you find you can import it very easily into SPSS. And in fact, let me show you how this works. We're going to try this in SPSS, but I want you to do two things. First, I want you to download the course files, and that will include a zipped folder by this name that ends with data sets. That's going to have three files inside it. I'll show you those in just a second. And then you can also open up this syntax file that will work with them. But let's go to see what's inside the folder and explain a little bit what's going to happen here. The folder that I've asked you to download contains three different files. Now, I have both the folder here, and I have the three files saved separately next to it, but normally they would be inside it. But for the syntax to work properly, you want them sitting separately on the desktop. All three of them contain the same data it says MBB, which stands for Mozart, Beethoven, and Bach, because this is Google Trends data about the popularity of search for each of these three composers names since 2004. This first one is in CSV or comma separated value format. The second one is a plain text file, and it's tab separated. And the third one is an XLSX file. So it's an Excel sheet. And you can see it's the same number, but it appears a little bit differently when I do the quick view here on my Macintosh. What we're going to do then is open up the syntax file. And we're going to see what we need to do to import each of these. Now, I've saved the syntax, but the fact is, it's easier to do this stuff through the menus. Now I give some information here about using the file path. In each of these syntax commands, I have to specify the file location. Now, this is the format, if you're on a Macintosh, like I am, of course, you'll want to change Bart to be the name of your home directory. If you're on a Windows computer, you're going to need to change it to something a little more like this, or possibly depending on the version of your operating system, using backslashes instead. Anyhow, I'm going to show you how to import each of these. And I've got the duplicate information here in the script, in case you want to run it that way. But it's actually really easy to do it from the menus. So here's what I'm going to do. I'm going to come up to my data window, I'll just click over to that. And my data window is empty right now. I'm going to go to file, open, and data, you do that if you're opening an existing SPS file, or if you're importing something in a different format. Now, here I'm on the desktop, you can see my folder there, but you can't see the three data files I have next to it, because right now, it's only going to display files that are in the dot save, that's the SPSS proprietary data format. I'm going to click on that, and come way down here. And we'll start with the text file, the txt version. So I'm going to hit that. And now you can see that it's there. I'll select that file, and I'll click open. So now I have the SPSS text import wizard. And we can scroll through most of this pretty quickly. It has to fit matches a predefined format, something that would have saved somewhere else, it doesn't. It has if they're delimited, yes, they're delimited by tabs in this case, are the variables included at the top of the file, you see how they show up here as the first row. Well, I click Yes. And now it excludes us because it knows that those need to be the in the header of the data file. It continue. Each line represents a case. I want all of the cases. You could sample from it if you had a very large data set, and it would allow you to do exploratory analysis more quickly than you could otherwise. It asks what delimiters appear. Now by default, a text file, the one that I have uses tabs in it knows that it asks about text qualifiers, I don't have text qualifiers in here. So I just hit continue, don't have to change anything. Now I have dates here at the beginning, and they are year dash month. Now, SPSS can handle dates. However, it doesn't like the fact that I'm using year and month without the day associated with it. Consequently, I'm going to leave it just as a string variable as a text variable. And it still works properly in any analyses I want to do. So that's fine. I'm just going to hit continue not changing anything here. It asked if I'd like to save the file format for future use. That's the thing I was referring to in the first dialogue here. And that's if I want to paste the syntax, I could do that, but I've already got it pasted. I'm just going to hit done. And there it is, it's opened it up and it's formatted properly. If we go to variable view, you can see it's got a string variable, it's got three numeric variables, it has the proper number of digits has the proper number of no decimal places. And it recognizes them as nominal, which actually is not the case. So I actually need to come here and change that to a scaled variable. Because the data that you get from Google Trends is sort of zero to one percentages in terms of relative popularity search terms. So I change that to scale. And otherwise, I'm good to go. Now let's do the same thing, but with a CSV file. To do that, I'm just going to get rid of this data file. I'll just open up a new one. There we go. I'll come back up to the file and open to data. This time, I need to tell him looking for a CSV. But if you remember, that's actually under text. So I click here. And except this time, instead of selecting the .txt file, I'll select the .csv file. And what you find is that the procedure is almost identical. There's only one super tiny change here. I hit continue. I tell the variable names are at the top. It is delimited. It needs to know that each line is a case. I just hit continue on all of this. Here's the one difference. When I did the text file tab was automatically selected. Now that I'm doing a CSV, which means comma separated values comma is automatically selected. I hit continue. It does the same thing with months, we're going to leave it a string, I hit continue. And I can hit done. And you see, it looks exactly the same. I do have the same issue though, that these three numbers which go from zero to 100 are coded as nominal, I need to change them manually to scale, right. And now we'll do the third one, an Excel file. Now in a lot of programs, you get very stern warnings about importing Excel files. And there's good reasons for that, because Excel files are very flexible and people can put a lot of stuff in there, again, comments and changing column widths and merging cells that make it easy to use Excel just for displaying information. But if you're importing, you don't want to do that. Fortunately, I have it set up as tidy data already. columns are the same as variables rows are the same as cases, there's nothing else in there. And so what I can do in this case is come to file open, we'll go to data again. And this time I come down to this one, it actually has Excel file as a format. There it is, I'll hit open and you'll see that the dialogue is different in this case. It says opening Excel data source instead of the text import wizard. It says read the variable names from the first row that's checked by default. It knows how many rows of data I have. And it's got this thing about maximum width, I don't need to worry about that, I just hit okay. And that was that here's the data from Excel. It's the same data, I still need to change these three measures manually, you could save this information in syntax if you're going to be doing it many times over. But that is sufficient for what I need. And so it turns out that importing information into SPSS is really easy. And it's massively more efficient and easier to do than entering it directly. You do it in a spreadsheet, especially if you did on Google sheets, if you're entering stuff manually, you can collaborate on it. And then you save it as a CSV file and you pop it in there. And then you can get straight to your analysis. And that is the point of all this work anyhow. And now in SPSS and introduction, we get to the part that maybe you were waiting for and that's analyzing data. I'll mention however, I'm going to give only a very small overview of analyzing data, because we have an entire separate course here for data analysis and also data visualization in SPSS. And I recommend that you check those out. But as a taste of what's available, we'll talk about a procedure that's of interest to a lot of people in applied settings, and that's hierarchical clustering. Now, the idea here is that you're trying to find clusters, you're trying to find the clusters in your data. More specifically, what you're trying to see is whether similar cases cluster together in some way that you can use to make inferences about them. The trick, however, is that similarity depends on your criteria. And there's a few decisions that you have to make when you're doing a cluster analysis of any kind. So for instance, you have to decide whether you're going to do a hierarchical cluster analysis, which goes from one group to as many groups as you have cases, or whether you're going to use a set K or set number of clusters. You also have to decide on the measures of distance that you're going to use Euclidean distance, which is sort of like measuring the as the crow flies distance between cases is very common as is squared Euclidean distance, which is what SPSS uses. There's also the question of whether you want to start with everything together and split it up in a divisive procedure, or start with everything separate and put it together in an agglomerative procedure. By default, some programs like R do divisive. But by default, SPSS does agglomerative. You basically end up with the same general findings anyhow, so it's really not a huge difference. So we're going to do a cluster analysis, but we're going to try to keep it simple. We're going to use some of the most basic methods for doing this, we'll use Euclidean distance or squared Euclidean distance in this case, we'll use hierarchical clustering where we don't have to choose the number of groups ahead of time. And we're going to use an agglomerative procedure where it starts with every case separate and then gradually puts them together. We'll try this in SPSS. But I need you to do something first. There is a folder that you can download from the case files that ends with data here. And in it, there's one file. It's cars dot save where SAV is the proprietary SPSS data format. And in addition to that, there is the SPS syntax file, and you'll want both of those for this demonstration. If you save the data file to your desktop, it looks like this, you can just double click on it and it will open up in SPSS. You also have the option of using syntax to do that. It depends on your operating system. This is for a Macintosh right here. And this is for a Windows computer, though you may need to use backslashes instead, depending on your version of Windows. I'm just going to go back and double click on this to open it up in SPSS. And there's my data set. What this data set is is a slight variation on a data set called mt cars. That's in the default data sets package in R. It contains road test data on a number of cars from 1974, from the magazine Motor Trend. And what we're going to do is we're going to look at this information, we're going to see whether the cars cluster together in some important way. I'll go to the data view here and you can see we have Mazda RX4 Hornet Sport about Mercedes 450 SE Lincoln Continental and so on cars that were all available in the early 70s. And we have information about miles per gallon. We have the cylinders. We have the displacement in cubic inches horsepower, weight in tons quarter second time in the standing quarter mile, whether it's an automatic or a manual transmission, the number of gears in the transmission, and the number of carburetors are probably carburetor barrels here. I'm going to turn on the labels, only one variable changes here. By the way, one of the things I did is I formatted this for SPSS by adding labels and changing some of the decimals and makes it a little easier to work with in the program. But let's go to the syntax file right now. Once we have the data open, we want to do a default hierarchical clustering. Now this is the code to produce it right here. But I'm going to do it with the drop down menus to show you that it's really not hard to do. All we need to do is come up to analyze. And then we come down to classify. Now I have to admit off the top of my head, I cannot remember if every version of SPSS has this particular menu. Most will, I hope yours does. So you can follow along with this hierarchical cluster, I'm going to click on that. And what I'm going to do here is I'm going to take car name, which really tells me it just says what the cars are. And I'm going to use that to label cases, because that's going to mean something to me. And I'm going to take all of my other variables, we'll just do a little shift click here, and put them over here. And at this moment, I'm going to change nothing else. You'll see there's going to cluster cases. That's what we want. It's going to give us both some statistics and some plots. That's fine. I'm going to hit OK. And we're going to get a result identical to my first syntax command. I see it sound I'll make the output window bigger here. And here's what we have first off, it tells us how many cases there were there were 32 and they all had complete data, which is nice. Then SPSS gives us something kind of unusual called an agglomeration schedule. And it really specifies at what point in the procedure did two cases get put into the same cluster. I personally don't have much use for this, except I know that when there is a big jump in the coefficients as there is here from three to 26, you know that there is a very distinct category change as there is from a 660 to 1125 and so on. Most of the time, though, I would just completely ignore this one. And this, this is called an icicle plot. And it shows sort of the same information about when various cases got dropped in with everything else. It's kind of pretty to look at. I find it kind of meaningless. And so truthfully, the default output for SPSS is hierarchical clustering to me is not very helpful. In fact, it's so unhelpful, I'm just going to delete it all. And I'm going to do this over again. I'm going to come back up to my recent menu items. And I'm going to go to this analysis again, I'm going to make a couple of changes. I don't want the agglomeration schedule that doesn't really help me. And for plots, I'm going to get rid of the icicle plot. And I'm going to get a dendrogram instead. Dendrogram, that means branches in Greek. So it's a graph of the branches. And this is usually the most important thing you can get out of a hierarchical cluster analysis. I'll hit okay. And now what we have is a chart here that lists all the cases, the cars on the side, and it shows how they group together. So we see, for instance, these first four cars, the Mazda RX4 and the wagon and the Mercedes 280 and 280C are very similar to one another. They all go here together. We see that some others, we come down here. So for instance, the Cadillac Fleetwood, the Lincoln Continental and the Chrysler Imperial, which are all gargantuan American cars with big V8s, they all go there together. And then we see down here at the bottom that this one the Maserati Bora is all by itself for a very long time. This is where cases are individual here on the left, and they gradually get put together. And you see how they come together in each of these branches. That's why it's called a dendrogram. And so this is a really nice way of seeing how similar your cases are. And if you have more pixels displayed, you can see the entire graph at once, I've got a low resolution right here. And you can see that maybe it makes sense to split this off into say four groups looks like we've got a distinct group right here, right there, right there, and right there. And so I can do something else with this. I'm going to come back to the menu here. And what I'm going to do is I'm going to save group membership. Now, I've done a hierarchical analysis. So I didn't have to specify the number of groups. But now that I've looked at the chart for seems like a good number. So I'm going to come here and say, give me the group membership for each case, breaking it down into four clusters. I'll hit continue. And I'm going to ask for it to not give me any plots. I hit okay. And this time, we're not going to get any output except to say that it did the work. Let's just get that. Here it says that it process them, the place where we're going to see the difference is in the data file. So I'm going to move over to the data file. This button, by the way, will get me over to the data. And now you can see I have a new variable that got added here for clusters for. And you can see that each of the cars is listed in one of these four clusters. And what you can do then is you can then take these cluster memberships, and you can compare them on the other variables. Again, remember, the clustering here is only as valid as the data that we give it. So it's only comparing these cars on a small number of variables. And it's using that to decide what goes with what. It's here, for instance, that you see the Maserati Bora is in a category all by itself. And this is a neat way of looking at the similarity between items, you can do it with people, if you're doing market research, you can do it with companies, if you're doing some sort of segmentation. And it allows you to see what groups have important similarities for what your purposes are, and which groups you need to treat differently is one another. That's the goal of hierarchical clustering analysis. And you find it's a very easy thing to do in SPSS. Another important procedure in SPSS when you're analyzing data is something called factor analysis. Now I like to think of it as looking at your data and trying to find shadows. In this picture, what you have are shadows, those are the black figures that you see, it takes a moment to figure out that you're looking down and they're actually our people, but kind of sticking straight out. And so in this photo, what you're going from is sort of a three dimensional origin, that's the person itself, to a two dimensional variation with the shadow. What's interesting about that is you maintain most of the useful data. You can tell that they're people that they're walking, you can probably even tell some things about how tall they are what they're wearing and so on. What you've done is you've made things more efficient. Now in the data world, that's called dimensionality reduction, where each variable is a dimension. And too many variables can actually be really problematic. You're trying to boil things down a little bit. And you can think about the saying less is more less equals more. More specifically, that is less noise and fewer unhelpful variables in your data set equal more meaning because that's what you're trying to do. You're trying to extract meaning. Now, when it comes to factor analysis and related techniques, I have one very important piece of advice and that is to be practical. At all points, you want to remember what is your goal? So what is the goal? Well, the goal of factor analysis, I'll tell you what it's not. It's not an exercise in analytical purity. You're not there to show that you know how to go through all the steps in the approved format. Really, you're working with your data because you're trying to get some understanding. So the goal of a procedure like factor analysis is useful insight. Try to follow the rules. Do what you can to make sure you don't make any obvious mistakes. But remember, you're not bound by the mathematics, you're bound by what the data tells you about the people. Another way of looking at that is use factor analysis or really any other procedure for its heuristic value. That is, it suggests possibilities to you as you analyze the data, as you're trying to get insight to people. Now, that's sort of a philosophical discretion. Let me show you how this actually works in SPSS. You're going to need to download from the course files a folder that says data here at the end. And from it, the cars dot save data set, this is the one that we use in hierarchical clustering as well. And then you want to open up the SPSS syntax file that goes with this particular section. Now the easiest way to open the data set is simply to double click on it and you'll be ready to go. I do have some syntax you can use if you've saved it to your desktop. I've got it open already. So let's take a quick look at the data set. We have a collection of cars listed down the side and attributes like miles per gallon and so on and gears and the transmission and carburetors. That's great. Now I will have to make a very important confession here. This is a very, very small data set for factor analysis. It only has nine variables other than the identifier and it only has 32 cases. Really, you would want to have at least several hundred cases and let's say several dozen variables before you can do this really reliably. But this example works and it actually is really easy to see how it's happening and how to interpret the results. The first thing we're going to do if you look at the syntax is we're going to do a default factor analysis and it's actually a misnomer because it's not a factor analysis. It's principal components analysis. But it's in the factor analysis command within SPSS. So let's come up here to analyze and down to dimension reduction. Remember, I said that's what this is called. Well, pick factor is our only choice there. And what we need to do is choose the variables that we're going to use to see what we can compress what goes into what so we don't need the name of the car. That's just an identifier. We can take the rest of these however, and we can put them under variables. Now, we've got a lot of options here. I'm not going to do any of them. I'm just going to hit okay for right now. I'll make the output window bigger. And here's what we get from the default analysis. We get a text output of the commands that were generated by the dropdown menus. We get something called communalities. Each variable brings with it one unit of standardized variance. That's based on how spread out the scores are. And if you standardize them, then you have a variance in the standard deviation of one for each. And the extraction tells us how much of that variance is really able to get constituted through the process that we're doing. An important one right here is the total variance explained because what this has done is it has created components. Remember, I said this is actually a principal components analysis here, which while it has profoundly different philosophical underpinnings from factor analysis, the difference has to do with which came first the factors or the observed variables and truthfully, most people treat them as relatively interchangeable. And if you're using them for heuristic value, it's not going to be a big difference. But what we have here are two components. We have one with 5.472 units of variance, that 61% of the original variance of the nine variables. And then another one with 2.341, I'm getting those numbers from right here. And you can see it held onto these two, which collectively add up to about 87% of the variance. Now the component matrix shows the relationship between the original variables and the two components, these are like correlation coefficients. You can see that miles per gallon is strongly negatively associated with the first component and really not associated with the second, but number of carburetors has a pretty strong association with each. And so that's a way to start to look at it. But it's going to be a lot easier if we do certain modifications to this. In fact, I'm going to just delete this output right here. And we're going to start over, I'm going to make a few changes. Let's go through each of these options. First, we go to descriptives. And I don't really feel like I need the initial solution. So I'm going to unselect that I'll hit continue. Extraction, this is the actual algorithm that SPSS uses to work through the relationships in the multidimensional space. You'll see right here, it's principal components. That's why I said this is really a principal components analysis. You've got a lot of options here. Now, in many situations, maximum likelihood would be a very good answer. I'm going to choose principal access factor and simply because it's the classical version of factor analysis, I don't need to see the unrotated factor solution, but I do want to see something called a screplot. And that is a graph that shows me maybe how many factors I should keep. I'm going to come down here and change the maximum iterations for convergence that has to do with the math that's done. I'm going to change it to 50. Then I'm going to come to rotation. What you get here is a multidimensional space. And sometimes it's a little easier if you rotate the axes, it can increase interpretability. Now, there are a lot of different methods. Varimax is a method that maintains orthogonal relationships that makes all of your axes perpendicular to each other. There are situations where that's really good. But truthfully, for exploratory purposes, which is what we're doing, I like to use what's called an oblique rotation, that allows your factors to be correlated with each other, they don't have to be totally perpendicular. I'm going to use direct obliment, Pro Max is another really good choice, but it usually is for larger datasets. And I've got a tiny one here. Now here, I can get a rotated solution. I don't think I really need that. But I do want to see the loading plot. And I'm going to increase the maximum number of iterations to 50. I'll hit continue. We'll come down to scores. And you can save the factor loadings as scores. And there might be situations where you want to do that. But because I'm using factor analysis for its heuristic value, as a way of suggesting what variables go with others, I'm actually not going to do that. So I'm going to hit cancel. And then finally options. This is where you get to talk about excluding cases, I have a complete dataset. So I don't need to worry about that. But the coefficient display format. Now, I'm going to sort it and then I'm actually going to have it completely erase small coefficients. Now I've done this one before. So I happen to know that a value of point six under normal circumstances, that's really high. But given my very small dataset, this seems like a reasonable choice. And it makes the solution very, very clear when we look at it. So I'm going to hit continue. And then there I'm going to hit okay, I've got my output here. And the first part's pretty similar, except it doesn't start with unit variants for each of these. That's because I'm not doing principal components anymore, I'm doing principal access factoring. And so the math behind it's a little bit different, but we don't need to dwell on that one. Total variance explained, you see that we still have two factors. And the first one accounts for a lot of the variance, the second one accounts for a fair amount also. And these are very close to what we had with the principal components. The screen plot is a very simple line plot that suggests how many factors we might want to keep. Now there are several different rules you can use for interpreting this. One is anything that's above a value of one, because one is what it would be if a variable explained simply one unit of variance, but that's what it brought with it. You want factors that can explain more than that. And you see we have two that do a lot more than one these others are sort of straggling down. The other rule is to look for a bend in the line. And you do see a strong bend right here. So three is where the bend is, we're justified and staying with two. There are other methods that get more involved about checking for the slope of this line and finding things that are above that slope. You can do those in another time. This is a quick demonstration. For a final look at SPSS and analyzing data, at least in this brief overview course, let's take a look at one of the most useful procedures around regression. Now, you might think of regression as sort of the statistical version of the three musketeers where it's all for one. I say that, because all for one is actually all variables for predicting one outcome. Put another way, regression uses many different variables, many predictor variables, to predict scores on one outcome variable. This makes it useful in a huge range of circumstances, especially because there's something for everyone with regression. There are many different versions of it and many adaptations of regression that make it truly flexible and powerful when analyzing data and make it a go to tool for almost any analytical purpose you might have. We'll try a simple version of this in SPSS. First, make sure you've downloaded this data folder from the course files, and we'll use the cars.save data set that we've used in our two previous examples, along with this syntax file. Now, when you get to the syntax file, it begins as usual with the code for loading the data set from the desktop. Truthfully, it's easier to just double click on the file cars.save and have it open it up directly in SPSS. That's what I've done here. And you can see it's the same data set with about 32 rows of data, a bunch of cars from 1974, and several variables. What we're going to try to predict in this one is miles per gallon, based on things like the number of cylinders, the displacement, horsepower, weight, quarter second time, transmission and kind and gears and carburetors. All right. So that should be pretty easy. What we're going to do is go to analyze and come down to regression. And we'll use the second option here linear. That's just basic linear regression. Now we need to put under dependent the outcome variable thing we're trying to predict kind of bugs me here because independent and dependent really should be reserved for manipulated experiments, but we still know what they mean. Our outcome variable, the thing that we're trying to predict goes here independent. So that's miles per gallon. And then we can take everything else except car name, that's just a label will take all the rest of these and we'll put them under our independence or the variables that we are using to predict the outcome. Now I want to do the totally default no extra steps version first. So I've put the variables in their respective place and I'll just hit okay. And now we get our output and it tells us first the code that was used to produce this analysis that it used all of these variables simultaneously to predict a single outcome which is listed down here, and they were entered at once. The model summary tells us that we have a multiple correlation of these predictor variables with our outcome variable of 0.931, which is really, really high. If you square that to get the proportion of variance explained, it's 86.7%. Even the adjusted R squared because we have a small sample is still 82%. It's it's huge. We get a significance test right here. We are not surprised to see that the significance is 0.000. It's not zeros all the way through, but it's it's highly significant. And then we get coefficients for the individual regression coefficients. So what we're looking for here are significance levels that are under 05. And interestingly, only one of them in this collection is under 05. And that's weight in tons. None of the others are there close. That doesn't mean that none of the others matter. It simply means that when you take all of the variables together at the same time, when they are taken as a whole, really only one of them deviates significantly from zero to become a predictor, that's a weight. Now, there are a lot of other ways of doing regression. And SPSS gives you a lot of choices. I'm going to come up here, back to analyze down to regression. Now I will mention, there's a really interesting one here called automatic linear modeling. This is a SPSS function that came in a few versions ago. It does a lot of automatic data prep, it does a lot of combining and splitting up variables. On the other hand, it's really kind of difficult to explain how it all works and then to interpret it properly. And I'm going to save that for another course where I specifically talk about analyzing data. For now, I'm going to go back to linear and we're going to make a few choices, we're going to make a few optional rephrase, and we're going to make a few choices, we're going to take some of the options that SPSS makes available. Now the first one I'm going to do at the risk of doing something very controversial is I'm actually going to go from simultaneous entry to stepwise regression. This is controversial because some people in the literature have called it positively diabolic in its risk of a type one or false positive error. And there's good evidence for that. On the other hand, in modern machine learning, stepwise procedures have been very fruitful used. And so it's not totally unacceptable to try it, especially when we're doing sort of an exploratory project like this right now. You certainly wouldn't want to use it for rigorous model building, but it's a nice way to get some insight into the data pretty quickly. I'll come up to statistics, and I'm going to add a few things. I'm going to get confidence intervals for the coefficients. Those are nice to have. We have the overall model fit and I'm going to get the R squared change, because a stepwise model goes through several different steps adding variables. And we want to see if each variable adds something that is statistically significant to the overall model. We could get a lot more information here, but I'll leave it there for now. Under plots, we can get a ton of different plots, but I'm actually just going to come down here and choose the standardized residual plots, a histogram, and a normal probability plot. Now there are other options as well. I could save about 15 different kinds of scores to the data set, I can save unstandardized predicted values, I can save studentized, deleted residuals, and so on and so forth. Things I could do here, and there are situations in which I might want to do those. But for right now, I'm going to skip them because I'm simply trying to build a model without necessarily saving all of the steps in between. Options really just talks about the criteria used in the stepwise procedure, I'm going to leave it at the default right now, but you could change it if you wanted to. And then style is a new thing that has to do with the formatting of the table. I'm going to leave that one alone for right now, because we're going to have exactly what we need. Now I've created this already, and I've saved it in the syntax, I'm just going to hit okay. And you'll see that we get a different kind of output right now. I'll zoom in on this. Now what we have is some code that's a little bit longer, this has to go through the variables one at a time, and find the predictor variable that is most strongly associated with the outcome, put it in the model, get partial correlations and go through step after step. What we find here is that, although we had nine predictors originally, only two of them were statistically significant when put into the model. They were weight and number of cylinders. Again, what we're trying to predict is gas mileage miles per gallon. If you come down here, you can see that they were both statistically significant, where the adjusted R squared for just weight is 74.5. And when you add on number of cylinders, it goes up, not a huge amount, but it goes up almost 8%. The analysis of variance table lets us know that both of these models with just one variable and with two predictor variables, they're both statistically significant. Here are the individual coefficients, along with their confidence intervals over here on the right side. Now, because we've gone through a stepwise procedure, it's not surprising that all of these are statistically significant, because that was the criterion used for including them. Here we have a list of excluded variables, along with their collinearity statistics. And this has to do with how much each of these variables is correlated with the others. So for instance, number of carburetors is highly collinear or easily predicted by the other variables that we could have included in the model. And then we come down to the residuals, I'm going to look specifically at the charts. In an ideal world, your residuals are normally distributed, which means they're just as likely to be high as they are low and they're symmetrical. And we see here that they're not horribly pathologically far from normal. So this is probably a good model in this set. And here is a normal PP probability probability plot of the same data. And if it were perfectly normal, all the dots would be on the line, the diagonal line, they're close. These are the 32 individual observations and how far off they're, they're close enough. And so this lets us know that our model is predicting really well, and it appears to be not biased in one direction or another. So this is one method of developing a model. Again, the stepwise procedure is best for exploratory analysis, it's not something you would use for confirming a finding. But as a quick way of sifting through a large collection of potential variables, this is a nice way to do it. It lets us know, for instance, that in this particular data set miles per gallon is predicted primarily by weight, which completely makes sense about the car, and number of cylinders which is associated with having a large and thirsty engine. So the general idea of multiple regression, again, is to use many variables to predict a single outcome. SPSS gives a lot of options for those we've looked at the default we looked at one variation on there. But there's a lot more that you can explore, and that we will cover in another course on statistical analysis in SPSS. But for now, I encourage you to take some time and look at some of these options and see the kind of insight that they can give you on your own data, and see what options you can use to get useful insight into your own analyses. I want to thank you for joining me in SPSS and introduction. And we'll conclude by giving you some next steps, things that you can do next, because you know, once you get through this, it can be a little confusing, feel like things are going everywhere. And it may not be totally clear where you should go. Well, here at data lab.cc, we've got a few opportunities for you. First, of course, is more SPSS, we have additional courses on data preparation on data visualization on statistical analysis and other topics that you can use to expand what you've learned in this introductory course and work on your own data. Now, if you've liked what you've learned with SPSS, you may want to try branching out to some other languages. The statistical programming language are and the general purpose programming language Python are very common powerful tools in the data science community and analytics in general. They're a great way to expand both the things that you can do with your analyses and your employment opportunities. And so I strongly encourage you to take a look at the courses on our Python at data lab. Next, we have specific courses on data visualization, one of the most important things you can do in getting to understand your data. SPSS can work well in those as well as other programs. And then I'm going to mention one final thing here. SPSS is a wonderful program, but it still has a fair amount of bugs. And it can also be very expensive. Fortunately, some really interesting work recently in the open source community has developed a program called JASP actually pronounced JASP, which is sort of an open source version of SPSS. It runs very differently. I find it very easy to use. And it makes it reproducible. It makes it easy to share. It's got some tremendous advantages. And we have courses on JASP here at data lab. I suggest you check those out and see how well that program is able to fulfill some of your computing needs. That being said, there are some things missing. What's missing exactly? Well, SPSS doesn't have a really strong and active user and developer community the same way that languages like R and Python do. But if you're creative, you can get around that. Academic conferences, meaning specifically, topical academic conferences like biology or management or the social sciences, they often have very dedicated SPSS users and teachers and may sponsor specific hands on workshops for learning more about SPSS and how can use it within your particular domain. But no matter what you do, I'm going to encourage you to simply get started, go exploring and see what you can do with SPSS in your own day to work. Thanks so much for joining me and happy computing.