 Hi, I'm John Little and you're watching the Introduction to R instruction series. This series is part of the R Fund Learning and Resources website sponsored by the Center for Data and Visualization Sciences at Duke University Libraries. In this session, it's going to be a quick demo of how you can do a data analysis project in R using RStudio. All the topics I'm going to cover will have longer videos if you want to dig into more detail. So quickly, let me explain what I'm going to cover. We're going to take a comma separated value or CSV file, move it into a folder on the local operating system, make an RStudio project, import that data into the project using an RStudio markdown notebook or an R markdown notebook, we'll introduce libraries, attach some onboard data, do a quick visualization, implement some other exploratory data analysis. Then we'll do some quick data manipulation demonstrating that the plier library, which is part of the tidyverse, will make an interactive chart, do some quick linear regression and save the notebook. So let's get started. First on my local file system, I want to make a new folder. I usually put my projects in my Documents folder, so I'll just type new folder, and it'll be down at the bottom. So I'll make a new folder called Test. Into Test, I'm going to make a new folder called Data because that's where I want to store my raw data. And now I'm going to take that CSV file, save CSV, and I'm going to drag it over into that folder. So I'm going to put it into the data folder. The other thing I'm going to do, I have a script to keep me on task. So now in my new directory, I have a data folder with a data file in it, and I have a script that I want to use. So I'll launch RStudio. I'll go to New Project, and then I will make that existing folder my RStudio project. Notice you can make a new directory right from here, but I have an existing directory. It's in my Documents folder, and it's called Test. And I'll click the Create button to make that project. First thing you'll notice is there is the quick start R&D file that I have moved in there, and there's the data folder, and inside the data folder is that file. So I'll go back to the root in that directory. Let's open up a new R Notebook. I'm just going to go here, File, New, R Notebook. I'm going to put my name here, author, date. I'm going to get rid of all this information, but it's useful information for beginners. Load the library, and then import some data. So I'll make a second model header, and I'll type, um, before that I'm going to put a little explanation. Okay. Give my project a name, test one, and save it. Notice as soon as I saved it, a derivative report file got reported, got created. That format of that report file is identified right there as an HTML Notebook in the output field. All right. Load library packages. I'm going to insert a code chunk, an R code chunk. I'm going to type library. I'm going to turn the warnings off on this because I know that there will be warnings. If you're new to this, I recommend you not do that, but for me it's going to be easier. I'm going to use a couple other libraries. All my libraries that are available to me have already been installed, and they're available right here in packages. If I needed to install a library that I didn't have, I could click on install and type in one of the libraries available to me and install. Okay. So library tidyverse for data manipulation and visualization. Library skimmer, which is good for skimming data, for EDA or exploratory data analysis. Library plotly, which allows me to turn my visualizations into interactive visualizations. Library modern dive, which will give me some convenience functions to look at my regression analysis. And I'll type library room, which actually does something similar to modern dive. I probably won't use them both. Okay. So I'm going to load my libraries. And then I'm going to attach some data. So import. Remember how I told you, or you watched me import this move data into this data folder using my file system, my Windows Explorer. So I'm going to insert another or code chunk. And I'm going to create a project called fave data. I'm going to type the assignment variable, which you can read as gets value from on my keyboard. I can type the alt dash, but you can type by hand less than dash. I'm going to use a data import wizard read underscore CSV. Double quotes and hit my tab folder so that I can, my tab key so that I can use tab completion and navigate to the data folder. My tab completion again, and there's only one file in there. So it completed it for me. And when I run this bit of code, I'll get an object up here in my environment variable. That will be the import of this data. And now I know that I have a data file with 14 observations and two rows. Now it might be interesting for you to know that you can, you don't actually have to have the data on your local file system. If it's accessible on the internet, you could have another, you could have another object name and import more data just by putting in a URL. And the skip argument skips the first 11 lines because in this case the file has information about where the data was in was originally found from 538.com. So I can run that and now I have two objects that are actually identical in my environment. I also want to attach some onboard data because we loaded the tidyverse library and the tidyverse library includes it's really a mega library that has it consists of importing or attaching eight other tidyverse packages or sub packages if you want to call that one of the packages we're going to use is Deployer, it's all time data, Star Wars part of the Deployer package. So now when I execute this kind of junk, now I have all three. So let's go ahead and look. So that's the Star Wars data and favorability data. They're side by side. Favorability or data is a favorability rating for Star Wars characters. There's 14 of them Star Wars data. There are 87 characters with various bits of information about each character. A quick visualization. In this case we're going to use ggplot which is another one of the tidyverse libraries. And I'm going to paste some code in here. And you can get this code from the GitHub repository plot object. The plot function. We'll get value from ggplot function where the first argument is the data frame Star Wars which we just loaded. And we have to identify and map the x-axis to a variable inside of Star Wars hair color. You can see it right there. We map that with this argument. The aesthetics maps x to hair color. And then we visualize that with a g on bar to make a bar plot. So all that creates an object named plot. It has this function in it. And then when we call plot that function gets run. And if we execute this code we see this. I'm going to make my screen bigger. Let's just do one improvement. Control I for the keystroke to create a code chunk. You could type it out by hand. And in this one improvement I just want to sort the bars from the most frequent to the least frequent. Using another tidyverse package called four cats which makes vectors which enables you to make vectors into categories. And so I'm doing that right there and using the forecast function called that factor infrequent, which will order the bars by the frequency. And when I execute this code chunk. You can see that I'm now a slightly better bar chart. It actually needs a fair amount of work, but this is a quick overview. Right. Going on with my exploratory data analysis. Let's insert a code chunk. And use the skimmer library to skim star execute that. And we get some basic information about this data frame some that we already had like their 87 observations and 14 variables, or 87 rows and 14 columns that we go back to the full view and you can see that right there, right. So we get some calculations 14 variables. It tells me some other information the first character data list data and numeric data gives me some information about the character variables and information about the list variables and some information about the numeric variables, along with some spark graphs down here at the end. Help me see the distribution as a histogram of the numeric data. Another way to get a summary. Insert a code chunk is to use these are just type summary. I'll use the other data frame and I have availability. And that is a data frame that only has two variables, one called favoring and one called name tells me stuff that's nice to know the name. The first variable is a 14 length character vector and the favoring is a numeric vector and it gives me the quartiles in the medium in the meat. All right, so let's join those two data frames together using a require function called left join. Star Wars data frame using a pipe. We say and then left join left data frame the Star Wars. The right data frame is favorability. We're joining by the variable called name, which exists in both data frames. And then we're just using a select statement and an arrange statement to present the data easier to read fashion. And what we can see is we've added variability favorability rating. Or eight of the 87 characters in the Star Wars data set. All right, let's transform this data further with the, with the five most common supplier verbs. And first we'll introduce the select data chunk of a select function. And select from Star Wars. If we just look at Star Wars by itself, right, it has 14 variables and we want just four of them. So we can name them. You can also call them out by position one, column one, column two, but let's just call them by name. It's easier. Actually, we're going to select three of them here. Name, hair, name, gender and hair color. And there we have a three column data frame subset by the columns. The subset by rows, we can use the filter statement. So in this case, we're going to say gender equals feminine. We use a double equals sign for equality. And it's a character equality. So we put the characters in quotation marks and we'll get things like Leia Organa and Beru Lars. Everything under gender is feminine. We have 17 characters including Leia Organa and Beru Lars. We insert another code chunk and this time we'll sort the data using the function arrange. Star Wars and then arrange in descending order by height, subrange in descending order by name. Organa arrange by this column. Tallest to shortest. And any place there's a match. And in the same value, we subrange in this case in reverse alphabetical order. So T on comes before Rupert in that case. Now we use mutate, which we can use either to add a new create a new variable or to modify an existing bear. So command alt I insert my code chunk based in some data and the mutate function is listed right there. Do a little bit of modification. So the sub data frame is subset to look like that. And then I'm going to create a new variable called big mass, which is equal to which gets value from the function mass times two. We introduced the count variable as it's actually a special case of a summarize function. But I think it's the easiest way to understand how summarize works. So we'll start by count. Count allows us to create subtotals of variable values. So we could start with the Star Wars data set and count how many characters are listed by which genders. And here we see that we've got 17 characters listed by feminine. 66 characters listed by masculine. And four characters where the gender is not listed. All right, so that is, as I said, a special case of summarize. Summarize is a way to get essentially column totals. So here we're going to take Star Wars and then we're going to drop in the mass variable. Any NAs from the mass variable. And then we're going to summarize by summing all of the mass variables together. And what we'll get is a column total that tells us that of all the characters in the data frame that have mass, they totaled 5,741.4 kilograms. And as you can see, it seems to work similar to count. But you can do more with summarize than what you can do with count, because you can do groupings, special groupings that work for you. For example, I'm going to put in just a subset of the code at first. If I drop the height and then group by gender and subgroup by species, I can then summarize where I have mean height, which is a new variable, which gets value from the function mean of the variable height, or gender and species, and another new variable called total, which gets value by counting the rows that fit those groupings. So when I execute that code chunk, I now have two new variables. I have my groupings, gender, and then species. And I have the mean height for each one of these gendered species and the total number in each of those categories. Now, if I just add a few more lines of code, I can clean this up and make it easier to look at. And so I'm going to list species first, gender second, and I can learn that I have 23 masculine humans with a mean height of 182 centimeters, 8 masculine, feminine humans, 160 mean height centimeters, and you see the list goes down. So let's make an interactive plot. This is using the plot.ly library. You remember earlier in this broadcast, we made a function called plot1. We can call that again, and it looks like that, and it's a nice plot. It needs work. It needs an x-axis and y-axis and a title. Could use some color. It could be converted with x to y and y to x by flipping the coordinates. That would make it a lot easier to read, but then we need to order it a different way. But the point that I really want to show you is that you can make any ggplot object interactive using the plot.ly library and wrapping the plot in a function called ggplotly. And now I have what looks like the exact same bar chart, but this chart, if you include it on a website or an HTML notebook or something of that nature, is interactive. So it has pop-up windows that tell you things about each bar. And you could also zoom in on a particular section in the chart. And if you double click, you can zoom back out. So that works really, really well in dashboards. So let's talk about regression. We're going to do a simple linear regression. In order to do that, we'll create a model. It gets value from the function lm for linear regression. So the response variable is mass. And it will be predicted by height, the predictor variable, or the explanatory variable. Notice that we identify the response variable with the tilde. It's on the left, and then tilde, and then height as the predictor. We have to identify the data frame. And in this case, I'm subsetting the data frame using standard teddyverse conjunction right there, and then filter mass is less than 500. That allows me to leave out the job of the hut, which will make my prediction more robust, although technically incorrect. So model gets value from the linear equation. And then I call model to display the value of the model. I can also then type the summary model. Let's start using some of the deep dive functions. Broom has similar functions. I'm not going to recommend one over the other. I think they're both very good. In this case, we're going to use the get correlation function to get the correlation of mass over height. And we can see the correlation value. The get regression table function, which is similar to the tidy function in Broom. I can generate a nice table that has the information I need to know, or that I need in order to make a useful assessment of this linear model, so that for the average height unit, the average mass will increase by 0.62 kilo. And then I have a whole table that includes standard error, statistic, p-value. So I can see that in this case it looks like it's significant. Now there's some other model data that I'd like to get. In Broom, you could use glance. In modern dive, you can use get regression summaries. So you can see, for example, the r-square and the adjusted r-squared statistic p-value again. One more nice tool that you have in both Broom and modern dive. In Broom, it's called augment. In modern dive, it's get regression points. So you get data for each point. We had mass and height. Now we have predictive mass and the residual difference. Lastly, let's visualize that regression. And we'll do that with our ggplot function that we learned before. So star wars and then filter where mass is less than 500. And then send that to ggplot where the x-axis is mapped to height, y-axis is mapped to mass. And then we're going to use the geogitter, which is like geop point. That allows us to minimize the effects of over plotting. And then we will also use geosmooth. Let's just run this layer first. There's the plotted results, mass over height. And if we run the whole equation, we can also draw a regression line with a confidence interval or showing where the standard error. One more thing, talk about reports. Up in the amyl header, I have a standard default output for an R notebook. And the output is HTML note. When I click preview, we can go back here to the working directory. As I'm operating on the test one markdown script, when I click preview, it will render the test one notebook, which is then a notebook that I could attach and share with somebody. If I click, I could scroll it here, but it's a little hard to see. I can also scroll here and see the report for the document I just made, which includes, by the way, the interactivity. So I can share this file with somebody just like I would share a Microsoft Word file. In addition to that, from this practice report, let me just note that there are many different kinds of rendered reports you can make. Notebooks like we just did, you can make a more polished HTML document. These are basically the same ones for development, ones for production. You can make a Word document, you can make a slide deck with Sheridan, you can make dashboards with flex dashboards or shiny. You can make ebooks. You can make web or which I also would think of, you might call it a web book with something with a package called book down. You can make websites with a package called blog down. You can make more simplified websites with a package called distill, or you can make a PDF document. And in fact, there are many other types of documents you can render. The point I want you to take away is that that is managed up here in the YAML header. All you have to do is configure your system and follow the directions. So that was a quick rundown. Anything we're missing, please feel free to go back to subtopic videos, which will drill down deeper and move a little slower on various aspects that we just covered. Thanks for watching.