 If you watched the last episode or if you happened to watch the long project I ran last year where I wrote an entire paper, you know, I have fairly strong opinions about project organization and their value in organizing your code, but also helping to keep things reproducible because you know where things are and you know where things live. Well, that doesn't come free. And often for most people it is a very different model than what you're used to working with, right? So most people are used to having their R code right next to their data, the input data, and right next to the output data, right? And so everything lives together happily in this big jumbled mess of a directory. And so when you start putting things into separate directories, then you have to start thinking about paths and how the locations of those files relate to each other. So that's exactly what we're going to talk about in today's episode. I'll talk about some good practices, some bad practices, and why those bad practices really get in the way of reproducibility and how there is a function in R that many, many people use that is destroying the reproducibility of their analysis. So I am in RStudio. I actually launched it a little bit differently than I normally do and I'll get into that here in a moment. What I did was I have RStudio in my doc. I went ahead and clicked on that R icon to launch RStudio and looking at this, it might not look that different than what you normally see when you watch my videos. But there's a couple subtle differences. So the first difference, you'll notice in the lower right corner, that this is my home directory. This isn't my project directory. This isn't the distances directory that I've been working out of for the past few episodes. The other thing you might notice is in the upper left corner, I have a tilde forward slash. So in a Mac operating system or Linux operating system, that tilde slash indicates that we are in my home directory, which is again what we see here in the bottom right corner. In fact, if I do get WD, I see that I am in forward slash users forward slash p Schloss. So anytime you see the first character of a path that is a forward slash, know that this is a absolute path. This is going back to the root of the computer's file structure. And so then users p Schloss is the expansion of the tilde forward slash. Again, if I do list files, which we talked about in the last episode, you can see all of the files and directories that are in my home directory. Obviously, this then mirrors what I have in the files tab here of our studio. But I don't want to be here. I want to be in the distances directory of my desktop. Well, how can we get there? Well, we could use a function called set WD. So we could do set WD and then we can give it the path to my distances directory. So then I could say desktop forward slash distances. And so now if we look at the upper left corner of my console, you'll see tilde forward slash desktop distances. So this set WD is a function that is a bit convenient, right? Because it allows us to set our working directory. And an alternative way to do that would be to use the navigation system over here in the files tab of our directory. So for example, I could go to documents, and I could go to say manuscripts, I could go to Schloss are an analysis, which was the paper we wrote last year together. And then I could do more set as working directory. And so then what we see is that that is actually the syntax that it uses to get me into the proper working directory. Well, let's go ahead back to our home directory, and go to desktop. And what I want is the distances directory. And one of the things that you'll notice I have in here is distances.rproj. So distances.rproj is not an r file, it is a r studio file. And it works well with our studio. There's also other packages and functions in those packages that look for this rproj file to do different types of analyses. Well, if I had instead of starting our studio by clicking on the R in my doc, by double clicking on that icon, this rproj icon, it would have automatically opened up our studio with the proper working directory. I don't have to worry about navigating around or getting into the right place to start our studio. So I'm going to go ahead and launch the project. Again, I could do that by quitting out of our studio and double clicking on distances.rproj back in the finder window, or I could go ahead and click on distances.rproj in my files tab here and say yes, I want to confirm open project. This then switches things around. And you see that I've got my working directory as my desktop distances off of my home directory. And I can do getwd to see the path and list dot files to see the contents of this directory. I'm going to go ahead and create a new R script. So within this R script, I'm going to go ahead and save this. And I'm going to put it into code. And I will call this analysis.r. And so I now have analysis.r. Again, this lives within the code directory. I don't see this within my project root directory. It is within code living right next to read matrix. So what I'd like to be able to do in analysis.r is to load the function that I have in my read dist.r script to run that function read dist using a distance matrix from data. And then perhaps down the road, we'll create a figure that will output to results, right? So we've got a couple of different things going on. So what some people would tell you to do would be to do setwd and then we'll change to code, right? And so now we are in the code directory. And if I do list.files, and again, if I put in quotes, the period forward slash, that is the current directory, I get analysis.r and read matrix.r. What I could then do is do source read matrix.r. And so what source does is that this loads the script, the r script, and it runs it, right? And so now if I look up in my environment, I have read matrix as a function loaded into my r session. And so now what I could do would be to do read matrix. And I now need my break artist distance file. But again, I'm in code. So I need to know the path to get back to my data directory. And so this is where it pays to know a little bit about how to access different things using a relative path. I mentioned before that an absolute path starts with the forward slash, well, a relative path doesn't start with the forward slash. So a way to go up the directory structure is with two periods, right? And so if I do period period that goes up back to my project root directory, and then forward slash, I can now go back down into data. And then I could do simple. And then I see simple break artist dist. And so now I will read in that. And so now when I run that, everything works, right? I have loaded in the distance matrix. And so in this case, what I've done is I've changed my working directory to be code so that I can then run the source function on read matrix dot r. And then I can use read matrix. And then I can get the data from the data directory. Right. And so we've kind of saved one thing. So having to know the path to read matrix dot r. But it's cost is something in terms of having to have kind of this goofy period period notation. So we're really expecting someone that reads this script themselves to have a pretty good understanding of the overall directory structure of our code. One other challenge, again, is that if I hadn't used the relative path, say I'd use the absolute path here, so I could, let me go ahead and comment this so we can save it and see what it looked like. So changing this path to an absolute path would have us do users forward slash pschloss forward slash desktop forward slash distances. And we could run that and this gives us the same output. So there's a few problems with this absolute path. So first of all, you don't have users pschloss, unless you're my son, you're not going to have pschloss off of users. Because you've got a different name, you've got a different account ID on your computer. If you're using Windows, then you definitely don't have the structure if I'm doing if you've even I'm doing this up on our high performance computer, it's not going to have this absolute path. So again, that's why a relative path is nice. That's why this relative path, even with its limitations of that dot dot and expecting people to know how the different directories and files relate to each other when they're coming into your code, that relative path is superior for reproducibility, at least to the absolute path. So when you're doing data analysis, and you want to be reproducible, use relative paths, not absolute paths. So an additional challenge with set WD is say I had a pretty complicated script, whereas moving in and out of different directories. So we've done this before with meta analyses, where perhaps we get data from 10 different projects. And we want to analyze the data separately and then pull it together, right? So I can imagine using set WD, I mean, I wouldn't, but I could imagine someone to go into each of those 10 different directories and do an analysis and then pull it all together. And so we might be doing set WD again, between those 10 different directories, as well as a directory for the synthesis. Well, let's say that the script crashes at some point. Well, I then have to figure out where exactly was the script when it gagged, right? So if I'm doing set WD in code, and then now I do set WD with say dot dot, now I'm in my working my project root directory, right, like the desktop distances. And so if I keep doing this CD or change directory into one directory and change directories back out to another directory, and back and forth, it can get really confusing as you're trying to trace where exactly you are in the project. So again, that is why set WD is a real problem for reproducibility. For my purposes, the main problem with using set WD is this issue of trying to figure out where you are as you're moving through the script. And so if things gag, then then you have to figure out where you are. And the alternative I think is just so much easier. It just requires a little bit of cognitive burden that once you learn it, you'll even question why you ever wanted to use set WD in the first place. So what would we do instead? Well, again, I'm assuming that I'm working in my project directory. So I'm here in distances, right? And so if I want to source, or again, load all the contents of read matrix dot R, I'll do source. But again, I'm in my project root directory. And I know that read matrix dot R is in code. So I'll say code forward slash read matrix dot R. Load that that runs a okay. And I see that I've got read matrix loaded over in my environment tab. I can then do read matrix. And I now need to give it the path, the relative path to my distance file. So it's going to be relative to where I currently am. So where am I currently? Well, I'm currently in my project root directory. And so kind of looking back at this absolute path or even this relative path that we had up here on lines four and five, all I need is what I have here. So the data forward slash my simple break artist dot dist. So I can then run that and sure enough that then outputs the contents of the distance matrix that we read in and formatted to be a square matrix with these nice labels and everything, right. And so what I do to make things reproducible is I keep all my code, all my data, all my output in the same project directory structure. Again, we talked about this in the last episode. But that way, then, everything can be done relative to the root of the project directory. And what I define as the root of the project directory is again, where that distances or whatever you call it dot our project file lives, that is the root of my project. And so I always assume that every bit of code is going to be run in the same directory as that our project file. And this really enhances the reproducibility of your analysis. It doesn't expect someone reading the code to know the structure of your project as intimately as you do. Again, if I'm in the project root directory, and I'm reading this code, I see Oh, they're running read matrix that are from code, and that they are then reading that that function, and then that function read matrix is probably being loaded from that read matrix that are script, and I'm loading into that a file from the data directory, right. So I'm going to go ahead and delete all this, right. And we can then save that. And this then is a much more reproducible version of the code than using set WD to move back and forth between different directories. And it just makes things so much easier to read. I do acknowledge there's a bit of a cognitive overhead to understanding how these relative paths work. But what I find is that if you are working from one location and then branching out to different parts of the tree within your project directory structure, things are a lot easier than being saying one directory, and then needing to worry about how many double dots do I need to go up that tree structure to get back to a common node or a common directory with some other directory that I now need to get access to, but do everything from the project root directory, everything just works so much nicer. Hopefully this clarifies how you can use a project organizational structure with running your code to enhance the reproducibility of your analysis. Again, I acknowledge the project is very simple at this point. And you know, perhaps the overhead of having the organizational structure I do for such a simple project at this point really isn't warranted. But it's really helpful to get these fundamental basic practices under your belt with simple cases before you scale up to say something like a paper or your thesis or you know, something much larger project. Again, after watching this video, I'd encourage you to go look at your R scripts and see if you have any cases of running set WD. And you know, honestly, ask yourself, do I really need set WD here? And what would happen if I gave this script to someone else? What would happen if I moved this directory to some other location on my computer? Would it still work? Do I have scripts that are expecting data to be on far flung locations of my computer? You know, what can you do to bring it all together and run all your code from a common root directory like we have in this distance's directory? Anyway, give that a shot. Let me know how it goes. And we'll see you next time for another episode of Code Club.