 I'm John Little and you're watching the Introduction to R Instruction series. This series is part of the R Fund Learning Resources website, sponsored by the Center for Data and Visualization Sciences at Duke University Libraries. So I want to talk to you as briefly as I can about reproducibility in RStudio projects. They go hand in hand. I want to give you some super brief descriptions of what I mean by reproducibility and projects. I want to give you a very quick demonstration and then I'll give you some tips. But first, before I do that, let me talk a little bit about motivation. A lot of students will come to me when they're new to R and they're doing everything in one project and they have lots of different projects actually organized into one single RStudio project. They share that file with me and inevitably it doesn't run. It's not good for them. It's not good for me. It slows the whole process down. The concept of reproducibility is the idea that your project can be run again somewhere else easily. This goes towards verification. This leads people to have a higher confidence in the work you're doing. And it just turns out that RStudio makes it very simple to turn any folder on your computer system into an RStudio project, which keeps everything together, the data, the scripts, the output. That same folder can be turned into a Git version control repository, which can be easily shared on GitHub. A simple test is take some project that you did six months ago, move that project to a different computer, and can you just run it by simply opening up the script and running it? Unless you created your project as a reproducible project, you probably can't. And the basic idea is that the most frequent collaborator you're ever going to work with is yourself, just separated by time. Reproducibility in a nutshell is do everything with a script, avoid point-and-click interfaces. That workflow is going to be really difficult to document. It's going to be really difficult to verify or explain to somebody else. Use relative file paths for everything that you're doing, especially when you're importing and writing out data. Write your code to run on a similar environment. So if I have R in RStudio and you have R in RStudio, you'll be able to run my project repository simply by clicking on it and opening a file and running the script. There's more about initial steps towards reproducibility by Carl Broman at this URL, which I'll briefly show here. Let me do a really quick demonstration. I'm also going to talk about, by the way, in the process of all this, I'm going to talk about literate coding. So I have a repository of an RStudio project up on GitHub, which was easy to share on GitHub because I produced it as an RStudio project. There's a project file right here that if I download this whole repository to my computer and unzip it and then click on this project file on my computer, it'll load and it'll be able to run. Let me show you. I've already downloaded it, so I'm not going to do that again, but in my Documents folder, in Intro to R code, I'm going to scroll down until I find the hexagon icon, Intro to R-code, and double-click on that. When I do, our studio will launch this project. All of these files are those same files you just saw in GitHub. If you want to create a new project, you can just go to the top right of this RStudio IDE and click on New Project. All right, now I can choose any one of these options, but the Virgin Control is a way to download a project directly from GitHub, or I can create a new project or make an existing project in RStudio project. In this project, I have all of my data in a data folder. I'm going to sort this by name, so it's a little easier. There's my data folder and there's all the data that I'm importing. So for example, if I open up script number two, join skim eda, I can actually run this whole file simply by choosing run and then run all. And as I do that, you'll notice that my environment variables start to fill out because the script is running as we speak. So all of these visualizations and everything were just part of a reproducible script. Again, to start a project, go to RStudio new project, and then create a new file. And I recommend that you also use R and D or R markdown files to create our notebooks, to create literate coding notebooks. The idea behind literate coding is that you're separating pros from code, but you're integrating your analysis with your explanation. If I'm making a reproducible R and F notebook, and I'm integrating my pros and my explanation with my analysis, that means that as my analysis changes, my reports change with my analysis. I don't have to do a whole bunch of copying and pasting, moving data here, and then shifting to other programs and moving data there, and then generating visualizations, and then copying those visualizations and integrating them into a report writer. All of those copy and move steps are going to make my project less likely to be reproducible. So I do this stuff in an integrated fashion, and I develop a quick view of a report that I could share with somebody. But as an author, you get to integrate your analysis with your natural language explanation, and that by itself makes it more reproducible. Couple other things to bring to light. If you're using an RStudio project, you no longer have to use absolute file paths, which means you no longer have to use setWD, which is not a good idea because setWD requires that you use the file system idiosyncratic to your computer, which makes it hard to share with somebody else, which makes it less reproducible. So you can do relative paths. In this case, we already saw that I had some data right there, so I could load more data. So new cars as an object gets value from read underscore CSV, and I put in double quotes, and I hit my tab key, and I start typing data because I've got tab completion, and I scroll through until I find the data file that I want, and then I type new cars because that's the object name I'm just creating so that I can see this object. And here I have a data set, much like the one that produced this output, so that if I want to create a nice visualization, new cars, and then make a scatter plot. So avoid using setWD. It's not reproducible. Avoid using rm plus sketch, list equals ls. A lot of people will tell you that clears out your code. It does a lot of clearing out, but it doesn't clear out everything. So an easier thing to do, particularly if you're making your code reproducible with an r notebook, is you can just do run, restart, and run all chunks, and that will cause the script to run itself and reproduce everything, and then you can update it. Here's a follow-up document by these authors on reproducible research, and here is an example based on their book. It's a free book. You can find it right here, but in their book they recommend a basic outline for how you might want to store your data and what kinds of folders you might want to have. And as long as it's, you're using relative paths, that's going to work out really, really well on every computer, and people will come to expect that there's things like a data folder and an output folder, and then it all just works. So that in a nutshell is NAR Studio Project and reproducibility. Because I've made it in our studio notebook, it's also a Git repository, which makes it easy to share on GitHub or any of the other social coding sites.