 Here is an overview of what we will be covering in the webinar today. I'll start with a general introduction about R, what it is and where you can get it. Then I will cover some of the practicalities about using R, especially if this is your first time. I'll talk for about 35 to 40 minutes in total and then there will be time for question at the end. As I said before, in order to ask questions you need to type it in, you can type the question at any time but we'll pick them up at the end. Let's talk about R and R Studio. R is a statistical programming language for data analysis and statistics. R is free and upper source, which means that there's a lot of people creating their own packages and programs around. It is available for Windows, Mac and Linux machines and is quite popular around the community. R has a huge community of users around the world that are willing to help you to solve doubts and questions. But R is a scripting language, which means that you have to write in code in order to do all your analysis. R is not as friendly and easy to work as you can see in the screen. That is why most people don't use R directly, instead we use R Studio, which is an integrated development environment. It's also free and open source. R Studio is a user friendly interface that keeps all the capabilities of R in a nice and organized screen. But R Studio acts as a mask for R. So in order to use R Studio, we need to have R installed in our machine. So here I put the links for both R and R Studio websites so you can download them later if you haven't done that yet. So this is the nice and organized interface of R Studio. It is a lot nicer than R, isn't it? It has a number of features that R doesn't have. So you can change the appearance. For instance, here is black, but you can have it in other colors. When you open R Studio, you will see four panels. Each of them plays an important role in R Studio. For instance, you can see the script editor and the console below the script, which is where all your code is being displayed. On the right side of the screen, you can see the R environment explorer, where we can see all the data that we are using and the variables and any object. And finally, we have the file explorer. So we can explore our folders in our computers and files. There are all the tabs available there. For instance, there is the Help tab and the Plop tab, where we can see some previews of our graph. So moving on from the description of the software, now I'll describe some practicalities that make our life easier with R. I'll start with script. Scripts are essentially a text file similar to a word document where you write code. If you have experience using stats, for instance, a script is equivalent to a do file. You could write any code directly in the console, but the problem with writing your code in the console is that nothing that you write will be saved. So the main advantage is that the scripts can be reused in later sessions. They can also be copied and modified for changing our analysis and also for sharing with others. It's always a good practice to do your R code on a script, as it would help to keep track of what you have done. As you can see in the screen, you create a new script by clicking on the File tab and then New File and then select R script. To open a saved script, you just click Open File and search for it on your files, and it's going to appear. For instance, this is a typical screen for the script. As you can see, you write in the script directly. To run a code, you have to... To run a code in the script, you just click on a line of code or highlight several lines and click the Run button there. Or you can click or select the line code that you want and press Ctrl plus Enter in Windows or Command plus R in Mac. I'm not a Mac user, so I'm not sure about that. In this case, the code that I have in my screen is to import a CSV file. Since I'm using the code so that I'm importing a dataset, the code is not going to appear in my console, but it's going to appear in the environment, something like that. As I mentioned before, you have the option to write notes on your screen. These notes can be a free textination of what you are doing or you can also add some comments about the code that you are using, whether it's working or not, or anything. So let's talk now about the working directory. There is nothing fancy about that. It's just basically a folder. Usually in your computer, the data that we need to analyze is saved. In order to use R for data analysis, we need to tell R specifically where this folder is. So R can import the data first. How do we tell R that our working directory is? Well, we can use the function instead of working directory that you see in your screen followed by the path in your computer where the folder is. Or we can also use another easy way for getting the working directory is just asking R directory. So how we do that, we go to the session tab on the main window of our studio and then set working directory and then choose the directory that we want. After that, we get a screen where it displays the working directory that we have set. So our studio will set the working directory to your preferred location and display the set working directory function and the console. You can save the code for the next time. One of the most frequent questions that I get when I'm teaching R is that people doesn't know where the data is. That usually happens when people try to import data into R without setting the working directory first. But R is not very intelligent, so R doesn't look for the data setting in your computer. It only looks on the working directory. And by default, R has a working directory, but it's never the one that we want, so we need to change it at the beginning. If you have problems, if you don't know what the working directory R is using, you can use the function GetWorkingDirectory as shown there, and that then is going to show you where the working directory is. Now we are going to move to packages. Packages are a collection of R functions or data that are compiled in a well-defined format. And R packs a bunch of data. It includes functions. It includes health menus, tutorials, examples that are stored in a neat package. R comes with a standard set of packages already installed, but we need more of the... We usually need more of those packages when we are performing our analysis. For example, the packages have to import data in specific formats, like Stata, SDSS. Or we also have the package ggplot2, that is for making RAP, or the packages ZMAP for mapping. Once the package is installed, they have to be loaded into the session to be used. Here you can see the functions that we use to install packages. So we use the function Install.Packages and then the name of the package. We can also use another easy way that is just clicking onto the screen, like this one. We go from the tools menu, then you search for the package. Install packages and then look for packages, like that. So you can write the package that you are looking for and then press install. When you are installing a package, you will see a screen similar to this one, where in here I'm installing the packages called side diverse. It can take some time to install the packages, depending on how big it is. After installing packages, R is not quite ready to do our analysis. We still need to do another step more, which is loading the packages into R, like this. We load packages using the function library, followed by the name of the package that we are loading. For instance, I'm loading here the side diverse package that I'm installing the previous slide. The package side diverse is composed by several other packages like JGpload, Siebel, TIDER, and other TIDR. We can install all of them separately as well. You only need to install the package once, but you need to load them every time that you are using. You can load the package at any time during your analysis. Just make sure that all the packages that you are using when you are doing some analysis are also in your script so you know what you are doing for next time. Now I'm going to talk about data types and data structures. This is important because R has a different name for some widely known variables and data structures. I won't be going into much detail with all of them. I just want to mention the most important ones, especially when you are looking for help in online, you're going to find that usually they use the word vector instead of variables or something like that. For instance, let's go with data types. We have characters which are just nominal variables like for instance female and male. This is what in status calls string variables. We have also numeric with a real or decimal number like just number. Integrate, as per definition, an integer is a whole number. For instance, five is an integer, but 5.1 is a decimal, so it's not an integer. A logical which only takes value of two or four. So in data structures, we have vectors, which I said are just variables, factors, list, matrix, and data frame. Factors are the equivalent to... For instance, when you put a label to the variable set, you have number one is for female, number two is for male. That's a factor variable for R. So variables are objects in R that store values. For instance, number three here is just a number, but when we assign the value three to the letter A, it becomes a variable, the variable A. I just mentioned they were assigned in R. We can assign values to variables. We can assign variables to data set. We can assign labels to variables and so on. To assign a value, we need a special character and a combination of two characters as they highlighted there, which is called the assignment operator. You can do all sorts of operations with variables, such as multiplication, addition, et cetera. We can display the context of the variable by typing into the console, such in this example, with the variable D, which is just a multiplication of the variable DC divided by A. So what is vectors, as I said before, are just variables, but they are defined as a single entity consisting of a collection of things. For example, a collection of letters, names, or a collection of numbers. So A is a vector as well. We create vectors using the letter C for concatenating. For example, the typical variable H in any data set that we usually use, it is a vector. So we don't normally are going to be creating vectors, but it's good for you to know what we are talking about when you see the word vector around. As I said before, there are different types of vectors, such as character or string, numerical, that only hold numbers or logical when they are true or false, et cetera. Now we are moving on to data frames or the data set properly. A data frame is a very important data type in R to put it simple. It's just a series of rows and columns with headers indicating variable names and rows, numbers, indicating case numbers. Tables, for the other hand, are also structured for storing table data. They are basically data frames, but just with some tweaks. As you can see, tables are a bit more informative than data frames and they immediately tell us the class of the variable, the type of variable, for instance, the variable country is a character, the variable year is an integer. That kind of thing that a data frame doesn't give. Tables also make it easy to read the numbers by highlighting the thousand, as you can see there. But on the other hand, a data frame offers a bit cleaner view than tables, so it's up to you which one you want to use. Both type of data sets are the same, and it works the same in R, but tables are the de facto data frame type used with the packages type diverse, because it works really, really well with type diverse. So why am I telling you about these two type of tabular data? Just because data frames are the most well-known data structure in R, and probably you are already familiar with them. However, we will be using type diverse in the workshop, so when you are importing data, you are going to see the word table, and probably you are going to get confused, so I don't want you to produce that confusing. If you want to learn more about tables and data frames, you can just read the chapter 10, there's a link there of the book R for Data Science, it's a very short chapter like three pages, and there is a good explanation of the difference between tables and data frames. So once we have set the working directory and have a better understanding of the data types and structures that we have, we can finally start bringing the data into R for some analysis. R can analyze almost any format, but we need to make sure that R will be able to read the data. There are a few packages used for importing data, for instance, the package having, the package foreign, and the package press R, so you can install them separately. Importing data using foreign, for instance, will create a data frame, but if we use the packages having, will create a table, for instance. One of the advantages of using having is that it works really, really well with labels and with data set from in data format on SPSS as well. So you have to use the right function to import the data frame. In this case, there are some examples using function repcsd, I'll read the DTA for status files, or repcac for SPSS file. Another important thing is to have to name your data, give your data name, otherwise it's not going to be stored in R. So I'm going to... Here you can see how the R session look like. I've set the working directory to my folder of choice, and you can see how the write on screen the folders and files I have in my working directory. The data that I want to use are in the data webinar folder here. As I mentioned before, this file exploration, you can click in any folder there, and you can explore what you have there. So I'm going to show you the easy way to import data set in R. This way only works if we have the type, diverse, publishing, and load. So in this case, we're going to focus on the file explorer window. I just double-click on the folder data webinar that I showed before, where my data is saved, and I want to import the family composition data set that you can see there. So I just click on the name of the data, and RStudio is going to give me two options. U file and import data set. I just click on the name of the data, and I just press import data set, and the data is going to be imported. So this is the screen that we get after importing the data. As you can see, there is a preview of how the data looks like. As it views the type, diverse packages, it's telling you that it's with the data with the packages red R. Here is a set of important options, and we can change the name, what is the name there, or modify anything if we want, but for now we're going to leave it like that. Here there is a code preview which shows us the code need to import the data. So that's why we can learn how to do it. So after that, we just press import, and the data will get imported. So here's the data right imported. There is a preview of the data, now in the console. In the left we have the environment window in which we see that we have a data frame already, or a Tivo. It says that it's got over 19,000 observations and 11 variables. Then in the console we can see the code used for importing the data set. So this is something that we can just copy this code on our screen so we can save it for later. The not easy way of doing that is just writing the code like that. I'm just importing the example, I'm importing the same data set. I just change the name to family and I run the code and I got two data sets. Sorry. I wanted to show you that the other way of importing data is fine. If you want to do code, if you want to do something easier and just ask our studio and clicking on our studio for doing it, just do that. There is no problem with that. Either way works fine. It has a particular way of working. So to perform any operation or any variable we need to use a specific name of the data frame and that we're using every time. There is an example here for the family when I'm using the function class of the family data set for the variable H. So in this code the function is class that give us the type of variable such as character, numerical or whatever. Then we write the function followed by a bracket and then we specify the data frame that we are using, this case family. Then you can see a dollar sign there which is a data structure which is followed by the variable H that is basically the variable that we wanted to use. It's a bit inconvenient but you get used to this very quickly. So after running the function, the result that I get is the class or the type of variable that H is and instead that is a variable, an numeric variable. Now I want you to show you the same but using R directly. This is R Studio screen. I have my script here and if you want to open you want to create a new script. You just go to new file and then press script and you're going to get script here. If you want to open a numeric script, just open by searching in your folder. It's already there. For instance, for the working directory that I showed you before you go to session, set working directory and choose directory and then start looking for the file that you want. Or you can just use the code like this one. I'm just selecting this and pressing run and the working directory is set. I want to know where my working directory is. I just use the function get working directory. You can only click on this. You don't need to select but if you want to select just select and press run and then it's telling me where my working directory is. So for install packages, you can use the code here. This is the code install packages. You can install more than one part at a time so we can do this actually. But I already have installed these packages so I'm not going to install them now. Or you can go to tools, install packages and then this window is going to appear. You look for the package. And then press install. Once the package has installed, you have to load them like that. Run. These are attaching packages. All the packages related with tidyverse. When install the package happen and package foreign. And now we can use all these three packages in our analysis. So this is just for example how viable it works. This is I'm creating a variable. You can see the variable A with the value 3 is there. If I type here in the console A, I'm going to get the value. Here I'm creating more variables. We can do some operation with this like this one here and we can save it in another variable in the variable D. We want to see what the variable D is. And all of these are here stored in the global environment. So loading an important data. We have this code here that we can use or I'm going to show you here how to use it. Here you can see my work directory. We can move here actually and explore. This is the data set that I want to import. So I just click on it once. There's the two options, U5 and import dataset. I'm going to import dataset and I get this screen. So here for instance I can change the name. I'm going to put just data. And it's going to change here. So now I can just import the data. You can see the code is here in the console. You can copy. Here you can see that we have in the global environment the new dataset data and here we can see a preview preview of the data. If we use the code it's basically the same code. I'm going to store the same dataset family composition but just by code. And you can see it's the same dataset. Just two ways of importing the same dataset. So using functions for instance class as I told you before is a function that turns out the type of data or structure that we have. The class of family is here is a data frame or a table. It has four different attributes. The class of the variable age belonging to the dataset family is numeric or the variable sex is character. You can see age here or sex male female. So it's character age just number. So that's what is numeric. The class of twin sister is numeric as well. So what else? Some basic analysis that you can do. Summary is just like the to get central tendency of any numeric variable. Summary of age we get here in the console the result we have a median of 21.3 a mean of 27. We can ask for instance something else. This value here NA is telling us that there are some missing variables 283. So we can get standard deviation for instance. We use family. Our studio also is good because it's giving you some hint what we want to use. Family and then we can search for the variable. And then we get NA here. This is something that is very common. We're going to get that because we have missing values. So we need to use another function here. This is NA.RM which means basically remove all the missing. And we put true. And then we can run it again. And we have the standard deviation for the variable. For not numeric variables like Namina we can have a frequency table and use the function table. We run and then we have how many women we have in the data. So yes. This is what I wanted to show you with the demonstration. Now let's start. So to recap of what we have seen. First of all, when you start with R don't forget to set your working directory. First thing. Then install and load all the packages that you're going to need. Obviously you can load more packages if you're doing more analysis if you need them. But remember always install the packages otherwise the function your analysis is not going to work. Third, you have to import the data. You have to use the right function for the right data that you have depending on the format. Give your data a name so you can then perform a new operation with those. Just remember that R is case sensitive. So be careful with capital letters, with spaces and all of those kind of things that sometimes are problematic. Another advice is just when you are naming a data set use an easy to type name because you have to type it every time. Remember the name of the data set and the name of the variable. So choose something short and informative. So where to go if you're stuck? Well, basically another thing that I wanted to tell you that the only way to learn R is just try an error. Try, try and try and probably a lot of errors. A lot of them, yes. Since I said before that R has a huge community online so there's also a lot of resources online. I've just highlighted some of them here. I put a link for the book R for Data Science that it's completely available online with some exercise and nice explanation. There are other resources like Quick R or Stack Overflow that are really good because in Stack Overflow you can ask questions or see other questions that people have asked about R and how to do things. Yeah, so follow these links and there's no changing copy if you see a code on the online of someone that is doing something similar to you just copy that code and modify it to your purposes change the name of the data, the name of the variables and then you can use it. That's the way for learning.