 Tällä kurssissa, olet tarvitseet ottaa yksi data-analysisessani. Ja olen tullut hyvä idea, että alaset tämän kurssin, jossa olet puhu, eri kertaa yksi kaveria, joilla teet. Tätä kurssissa pitäisi olla yksi tämän kertaisu, jossa käytetään yksi kertaisu. Tällä kurssissa voidaan käyttää yksi kaveria, jotta haluaisin. Ja minä olen kertaisu, jotta olen tullut kurssissa, jotta kertaa tosi on tosi ongelmaa eri kertaisu. Olemme käyttäneet tre different software on the course. Olemme SPSS, STATA, and R. These are fairly different, so you can complete the course fully using any of these software. I have very strong opinions on which of these software you should apply if you want to be a professional researcher. But let's take a look at first what's a statistical software and how does it differ from Excel. In Excel, your data and your analysis lives in one worksheet. So some of the cells have data and if you do calculations, then those calculations go to different cells or maybe a different sheet within the same file and then inside cells in there. And also all calculation results appear in the same cells. So you have data, analysis specification and results in the same file. And it's not very easy for anyone who has not done the sheet themselves or to understand what is the logical sequence behind the analysis. So if you calculate the mean and then you calculate the standard deviation, it's not clear by looking at the extra sheet, which one is calculated first. Some cases it doesn't matter, some cases it does. Statistical software is a different kind of tool. So statistical software has data, it has analysis specification and it has results. But these are typically in three separate files. And the data file is something that you hardly, you never edit. So your data file is what you have and that's, you never edit it. Then the analysis file lists the sequence of operations or commands or analysis that you apply to the data. And it's basically at the text file or a document and you read it from bottom to down and then the computer executes things or to the data using that sequence. So data analysis using statistical software is command driven. And commands can do analysis, they can manipulate data, they can load data sets, save data sets, do all kinds of things. All of these programs are command driven. R is a bit different because it's not as much as statistical analysis software as data and SPSS, instead it's a statistical programming environment. So it's much more focused on programming than data and SPSS, which are more focused on just sequentially applying commands to the data. Of course you can do that as well with R, but R is a much more general system. These have also different target audiences. SPSS is owned by IBM and they are one of the main markets corporations. So they want to target marketing departments and they have analysis techniques that are relevant for marketing like customer segmentation analysis and things like that that are not relevant for social science research. This data has been developed first by a person with a background in university and is focused on social sciences. So it's focused on social sciences and now it has life science research as well. But this is specifically designed for university researchers. And R is a programming environment, so it's designed to be very general without any specific target audience. What this difference means that with R you can do the most things. But R, because it's general instead of focused specifically on certain tasks, it may not be the easiest to use tool or the most efficient tool for doing something. Then STATA has a more narrow scope and it's very good at social science research. So most of the things that social science research wants to do, STATA provides. And it's a fairly nice to use tool for that purpose. SPSS, because of its focus, then lacks some of the tools that we apply on the course. So because it's not focused on the kind of research that we do, then you need to go through some extra steps to get some basic results. So it may be good for market segmentation, but it's not as easy to use as STATA. Once you know how to use it. The documentation is also quite different. So SPS documentation is about how to use SPSS. So that's the normal software documentation. It doesn't try to get you to understand regression analysis. It just tells you that if you understand regression, this is how you deal with SPSS. STATA on the other hand, the documentation explains the analysis as well. So this is a pretty good learning resource as well. So whereas SPSS manual tells you how to use SPSS, STATA tells you how certain analysis are used and why. And how you get things done with STATA. Then our documentation is not good for learning at all. Typically our documentation tells you how a certain comment is specified. And then it may point to an original source whoever first invented, let's say, regression analysis and tell the user to look at the details of regression analysis from the original source. So this is less user-friendly documentation. The availability of these software differs as well. Most universities that I've worked with have a site license for SPSS. Which means that SPSS is installed on all university computers. And typically university also provides a way of students and staff to install SPSS on their home computer. STATA on the other hand doesn't have such a licensing agreement. So STATA usually is installed in a computer lab. And if your university has a purchasing agreement, then typically it's fairly easy to get it on your work computer. But not probably not for your home computer. R on the hand is open source in typically installed on all university computers. And it's free, you can just download R and the R studio editor. It's highly recommended on your own computer and then just using it. Because it doesn't cost you, there's no cost attached. There are a couple of different ways how you use this software. And the different ways of use partially determine which software is best for you. The data sets and commands are two separate things in statistical analysis software as I explained. So the data file, that's whatever you got from your data collection efforts. You never edit that. It's columns and rows. And if you can, some software allow you to have multiple data sets open. Some software don't. It could be viewed as an advantage to have multiple files, multiple data files open. But then the problem is that when you execute a command, how do you know which data set you're actually working on? So with SPSS, I have multiple students who are really confused that they apply an analysis and the analysis result is unexpected. The reason why it's unexpected is that they have two data files open. They thought that they were analyzing the first data file, but SPSS was actually analyzing the second data. Then we have command files, which are a sequence of data manipulation analysis command. And these store the logic of your analysis. However you want to use your statistical software, you should always at the end have an analysis file. If you have a graphical user interface, which data and SPSS have, then those software, when you use them, they will produce a log file that contains all the analysis commands that you applied in that analysis sets. When you are done, at the end, then you save the log file. You extract the commands. You take out those that you don't actually need for the final paper. And then when you write your paper, you store the analysis file. This is important because you need to be able to replicate your analysis later. If someone asks you how did you get your results, unless you have the analysis file, then you can't repeat your analysis. If a reader wants to have changes in your analysis when you submit to a journal, then how are you supposed to do that if you have not kept track of what you actually did for the data? On this course, whenever you return a data analysis assignment, you must return a report and an analysis file as well. And this is very important because I can point you to many examples where researchers clearly have not stored their results. And when you ask them about the results, they have no idea how they did the calculation because they could have done it a year ago and then they have forgotten. Doing an analysis file ensures that you can always tell a person who wants to know about your research how exactly you did it. An analysis file is one way of storing the sequence of analysis, but there are basically three different ways of using this software. So we have first menus so you can generate or do commands using menus. So you point and click and you choose from the menu, request and analysis, then you have listed variables, you choose one to be the dependent, a couple to be the independent, and then you run the execute. Then the user interface of the software will generate the command which the software will then produce or run. R doesn't have menus which makes using R a bit difficult, at least for the very beginners, because you need to learn how the commands are typed in the very beginning. So R has a steep learning curve because of that. When you open it the first time, you may have no idea what to do. When you open SPS as a state at the first time, you can always see that there's analysis menu, perhaps clicking on the analysis menu, I can do some analysis and then there's request and analysis, perhaps clicking on that, you can do request and analysis and indeed that's the way you do request. Stata and R also allow you to type commands interactively. So you can type commands and this is the way most professional researchers that I know use their software. Once you know the basic commands, it's a lot easier to type just regression or R or REG for both sort of regression analysis and then type the names of variables instead of go and click through the user interface. So you're a lot quicker with keyboard than you are with the menus. So this is something that, for example, state of documentation recommends that you start learning. That it's not as the first thing, but as the second thing when you start using the software. And then you have the analysis files. So the analysis file is just a sequence of commands that reproduces all your analysis. Every time when you think that you did something stupid, then rerun your analysis file and that gives you a clean slate of the final analysis. That's how I use this software. It also, when we discuss which of these software is the best, one thing that you need to consider is the capabilities of the software and what does the analysis file look like because you always have to produce that at least. Regardless of how you, whether you use menus or typing, the analysis file is something that you will always have. So here's an analysis file example. Do the same analysis, set of analysis in state and in R. You don't have to understand what this means now, but it basically, what I'm doing here is that this is a regression analysis. So we have the regress command here, or LM command here for linear model. And I have a data set about professions. I'm explaining the logarithm of income and I'm having an interaction term with prestige of women. I have a categorical variable here. And so this is the regression analysis. So in Stata, we create a log of income. Stata will automatically generate interaction term for us. It will automatically do categorical variables if we indicate them with the i prefix. R will automatically treat this type as a categorical variable. And then we have this regression here, interaction term with a small multiply things together. R knows how to deal with that. Same with Stata. Then we have marginal predictions calculated here. And that's something that I will discuss on the course quite a lot because it's highly useful and underutilized tool. And then we plot the marginal predictions. So this is maybe one, two, three commands to do a transformation of one variable, regression analysis, and then plotting the result using marginal predictions. In R, we need to have a load of packets for the marginal predictions plot. We have two, three, four, five, six commands out of which one is loading a package. Then two are just printing out the results, the summary commands. So it's a small number of commands for a fairly impressive set of things. In SPSS, this is the regression part. So there is no marginal predictions. There is no plotting. You can't do that with SPSS. So this will, with SPSS, SPSS doesn't know how to deal with interaction terms. It doesn't know how to deal with categorical variables, numerical analysis. So you have to dummy code manually. So doing this, if you can type that, that's fairly quick to do. If you do this in the user interface, this part, maybe takes you 10 minutes to do, compared to just typing the variable name and allowing R to do it automatically for you or typing I period variable name and allowing the stator to automatically do it for you once you tell stator that this is a categorical variable. So in SPSS, you need to do a lot more data manipulation before the analysis, because the analysis command is actually less capable. Also if the regression command is fairly involved, you need to specify lots of things. It's not enough to specify just the dependent variable and the indefinite variables, but you need to specify all kinds of defaults, because for some reason the command doesn't work with empty defaults and default to some useful settings. And then once you have done the regression analysis, then you will need to copy-paste the results to Excel to do the marginal predictions and deploy the marginal predictions. So SPSS is a lot of, it's more work. It's more stuff going in the analysis file and it does less than, it does about a half of what these analysis files do. So which one do you think is the most convenient to work with in the long run? Well that's a personal preference. Some people can get away with never editing their analysis files by hand. So instead of, they just do a command and then they check what the command is using the menus and then they copy-paste it to the analysis file. But for example, if you need to change how you code this categorical variable, at least for me it's a lot simpler to just edit this syntax here instead of going and pointing and clicking again. So the SPSS syntax is not as user-friendly as a stator and R, but if you don't understand any of these softwares, basic syntaxes, it's going to be fairly impossible to know what any of these does. But it's just less typing your regress and then dependent variable, independent variable, same here, LM, independent variable, independent variables compared to this specification here. So my taken software is pretty clear. I don't think anyone should be using SPSS for serious research. If you want to be a professional construction worker, you don't go to the closest store and pick the cheapest. You go to a hardware store and pick a proper professional drill. That's the same thing here. We have different kinds of tools. SPSS is a very good tool for getting started. So if you just want to do the first assignment of this course and never do any quantity that research yourself, you're going to be fine with SPSS. If you want to do this for living, then our stator is probably a better choice for you. The R is also something that you could consider, but the problem is that R is a bit technical. So if you're a very non-technical person, then R may not be the right tool for you. There also are some good reasons to use SPSS. So there are lots of very successful researchers to use SPSS as their main tool. Their main competence is probably something else than data analysis. So if you specialize in theory, you just need basic tools for testing your theory and then you have others who do the more advanced tests for you, you're going to be fine with SPSS. But if you want to be very good in statistical analysis and quantitative research, then SPSS is probably going to be in your way at some point. I know quite a few people that have used SPSS in the past and have moved this data since, and I don't know anyone who has applied use data as their main tool and then moved to SPSS. There are some people who moved from SPSS to R, but that's pretty big leap because the software are so different. That being said, the use of R is increasing. The only courses that I give, R tends to be the most popular option because you can install R anywhere and it's always available for you. Data comes next and then perhaps SPSS. You're going to be fine with SPSS, but it's just not there for this course, but it's just not an ideal tool in the long run for you. So how do you get started? First, you need to familiarize with the software. So you need to get an understanding of the basic feeling of how the software looks, how it works. The status introductory manual is very good getting started. So go and do status, introduce status and process. Open status, go to help menu, go to getting started. Start working. They explain how the software can be used by typing commands, by doing things from the menu. If you want to use SPSS, I recommend that you do the same. There is a manual that you can access from the menus and then work through chapters one, four, seven, eight and nine in that manual. Those are the ones that are allowed for my teaching. If you want to use R, then I would recommend that you go through this learn to use R from Computer World and this will give you roughly the idea of what this software is about. Then when you actually start learning how to use this software, then you need some other resources. I recommend some books for R. There is R in action and R in data science, R for data science. R in action a bit more old fashioned R and R for data science is a more modern take for R. The problem with R for data science is that this book goes to, it gets to pretty advanced stuff pretty quickly. So R in action is more basic. R for data science is something that you should definitely read at some point if you want to be an efficient user of R. For SPSS, I recommend discovering statistics using SPSS. That's a pretty good book. The same person also has a book about R if I can remember correctly and that may be a good book. I haven't read it myself. Then for Stata, I recommend that you start reading the Stata user manual because that Stata user manual is pretty excellent. Then search for online examples. There are lots of websites that tell you how to do certain analysis in R, SPSS and Stata when you can compare. Ask for help online. For example, this course is data analysis forum. If you have a problem, ask where. Come to the computer lab. For R specifically, there are some really good online courses. For example, data camp has an interactive course. It takes you a couple of hours to do. It teaches you the basics of R. So you use R in a web browser. The course tells you what to do. Then you do it after you succeed. Then it tells you the next thing. One of my favorite resources for learning how to get things done with this software is the University of California Los Angeles data analysis examples website. The link is here. This is an excellent source because you can compare how certain things are accomplished with different software.