 Welcome back to the Riffamonus Reproducible Research Tutorial series. I hope you're able to make it through the last tutorial in which we discussed how to use a single bash script to automate our analyses. It took a little effort, but it was great to see how we could accidentally burn our project to the ground, and with the help of get and bash, we're able to rebuild it with minimal further input from us. In that tutorial, we used mother to process the raw sequence data to generate files that we'll be using in this and future lessons. If you'd never used mother before, you should definitely check it out. Other tools for analyzing 16S RNA gene sequence data are out there, but they tend to require a lot of other dependencies. These dependencies can be really difficult to track down, unless it's frequently called dependency hell for those trying to reproduce someone else's analysis. If you go to the mother GitHub repository, you can get any version of mother you want, and it's a self-contained executable. Similarly, our Wiki has all past versions of the silver, RDP, and green genes databases that will work well with mother. We've really gone out of our way to make mother a great tool for encouraging reproducible analysis. In addition, in mother, we have a command called make.sra, which has been a game changer in terms of helping people to post their sequence data to NCBI's sequence read archive, also called the SRA. This is the place your sequence data should be deposited before you submit your manuscript. In today's tutorial, we'll be discussing some of the best practices for analyzing your process data using a scripting language. Here you have various options for tools that you can use. Many people will use R or Python. I'll be discussing R, although again, I don't expect you to know much about R for this tutorial. Now join me in opening the slides for today's tutorial, which you can find with the reproducible research tutorial series at the rithomonas.org website. So before we get going and talking about scripting our analyses, let's remember what we talked about in the previous tutorial. So take a couple moments and think to yourself, maybe jot down a piece of paper or write into a Word document. What are some of the limitations of using tools with a graphical user interface or a GUI, as they're called, for doing reproducible analysis? On the other hand, can you also think of some limitations of using bash scripts for doing reproducible data analysis? So a couple thoughts come to mind when thinking about the limitations of both GUIs as well as using bash scripts. So for GUIs, it can be often difficult to document all the mouse movements and toggles that need to be checked, as well as what you're entering into the formula bar, things like that. It's also difficult to maintain under version control. And so if you wanted to see where a bug was introduced, that's going to be really difficult. And as we talked about in a previous tutorial, Git doesn't work very well with file formats like XLSX or DocX files that it really excels in working with text files. What are some of the limitations of using bash scripts? So although I think bash scripts are the way to go, I'm willing to admit that there are certainly limitations to using bash scripts. So first of all, there's a learning curve. You have to know bash. You have to know your way around the command line. In addition, the code can also be pretty cryptic. Most people know how to use Excel. It's one of the beauties of Excel and why it's so popular is that it's so easy to use that one of my kids could pick it up who'd never seen a spreadsheet before and figure out how it works. If I gave them bash code to read, it might be harder for them to read and harder for them to figure out what's going on. And I think the same is true for anybody that's coming to replicate our analyses. And so again, while the analysis might be reproducible, because we can run bash analysis driver dot bash and recreate our project like we did at the end of the last tutorial, if somebody dug down and looked inside that file, they might be hard pressed to tell you what exactly is going on with every line. And so again, there's ways around that. We can think about how we comment our code. We can educate people better, things like that, to improve the transparency of what's going on in those bash scripts. So although there's strengths and weaknesses of both approaches, I'm going to tell you that the best for reproducibility is using a bash script. Although hopefully you appreciate it also that one of the limitations of using a bash script for driving our analysis is that it's kind of dumb, right? But there's no intelligence baked into it to tell the computer to tell Linux where you're at in the workflow. And so when we re-ran the bash script, it started from the very beginning again, although some of those reference files had already been downloaded. And so what we'll talk about in a couple tutorials is a tool called Make, which is similar in idea to our bash script, but has built in intelligence to figure out where we're at in our dependency tree, what things have already been downloaded, what things have already been processed, and so where we're ready to pick up in the analysis. In today's tutorial, we're going to talk more about scripting our analyses. And so whereas before, we used bash to script kind of pulling together elements from different software, different tools. In this tutorial, we're going to look at a scripting language such as R or Python. We're going to develop a set of best practices that you can use for scripting the analysis of your data analysis. We're going to define and identify one of those called the dry principle, which is short for don't repeat yourself. We'll apply tools that will make our analyses more reproducible when using random data. We've already seen this a bit with using Mother, but I didn't tell you about it. And we'll also use some tools to protect ourselves and others from possibly encountering weird data, the idea of defensive programming. So to get you thinking about where we're going here, I want you to think about this case study, where you've perhaps generated a visualization of an ordination diagram for your latest microbiome analysis. And you've done it using point and click package like Microsoft Excel or Prism. So you've mimicked the boring shape and coloring scheme that was in the original Kozic paper. I think it was like black and white circles, pretty nondescript. You show it to your PI, which gets them really excited. So PIs tend to do this. I know I do. And so the PI starts asking you questions. Can you change the color and the shape of the plotting symbols? Is there a way to plot this in 3D using three axes rather than in 2D with two axes? Can you plot, this is temporal data. So can we plot it in 3D where one of the dimensions is time? Could we perhaps think about how we could animate this over time? And so make a movie of how the Murian microbiota was changing over time. So these are just a handful of questions that somebody might come up with. And pretty quickly you might think, well, yeah, those would be awesome, but I wouldn't know where to begin. And part of that is because tools like Excel or Prism lock you into how you're doing your analysis their way. So the other drawbacks of point and click methods like Excel or Prism is that they tend to be expensive. As a trainee, you might not realize this because someone else is paying for it for you, but these tools are really expensive. They're centralized development. There is a Excel way to make a plot. There is a Prism way to make a plot. And if somebody comes along with a new way to visualize data or something that you see that you want to incorporate into your analysis, well, if the folks at Microsoft don't incorporate that into Excel, you're kind of out of luck. It also can't be automated. So how difficult would it be to change the plotting symbols of your figure? In Excel, you're gonna have to go back into Excel. You're gonna have to, you know, I think click on the thing, click on the points and then go somewhere to change the settings. It's not super easy and then you're gonna have to export that into a new document. And perhaps you then, you know, do some type of staging using Illustrator, which is another really expensive tool. And all that, because it's not automated, lowers the likelihood that you could reproduce it. It's also not flexible, you know, everything's kind of in their format, their way of doing things. So in contrast, if you were to use an open source tool like R or say Python, these all tend to be open source, which means that they're free, they're decentralized, they're, which means that anybody can contribute code to it. You can open up, you can look under the hood and see how it makes a plot. You can modify that code. You can add code to change how plotting is done. These are things that are not possible with say something like Excel or Prism. They're also automatable, right? We can run R or Python from the command line and so I could change a line of my code and voila, that will change the plot without me having to really muck around too much with formatting or things like that. It's also very flexible and you might get frustrated because it's perhaps overly flexible that there might be hundreds of parameter settings, maybe not hundreds, but a dozen parameter settings for things that you can change to alter the way that your plot looks. It's also extendable that you could take a simple plot in R and you could turn it into a GIF to add animation or you could make a MP4 video of a plot. You could incorporate HTML into it so that it could become interactive. That these are all the beauties of open source tools is that there's large numbers of data scientists out there and people like you, people like me who are contributing code, making packages to make the language better and customizable to our needs. So again, if we wanted to change an aesthetic like the plotting symbol, the plotting color, we can change the option value and rerun the script. We could incorporate this into our analysis driver bash file. It's pretty straightforward. We also have access to the entire color palette. So I don't know about you, but when I go into Excel or any of the Microsoft software and I'm trying to pick a color, sometimes I can't remember what exact shade of blue I used because I'm trying to pick out of there kind of very categorical color palette. Well, if I can give it a specific code for a specific color I want, then that's what I get. And that's possible with a scripting tool like R or Python. But people have, again, taken that simple hexadecimal color palette and they've rift off it. They've done really awesome, fun things. And so one of those is here the Wes Anderson palettes that are inspired by the movie The Royal Tenen Bombs or Darjeeling Unlimited or any of the Wes Anderson movies. My lab's also made one that corresponds to sports teams. So if I make a plot with blue and red, I can use the Chicago Cubs version of blue and red or University of Michigan's Maze in blue. It's really adaptable. If you're a Beyonce fan, there are color palettes inspired by various pictures of Beyonce and the clothes that she's wearing. And so this is, again, seems a bit silly, but it's fun, it shows you good colors that work well together and helps to make our plots look a little bit more attractive than basic red, green, blue, black color pick schemes. People have also in R made packages to change plotting symbols. And so you can make a cat or plot where instead of using circles or squares or diamonds or whatever have you, you can put cats in different poses into your plots. And people are doing this, again, with other types of symbols. And again, this is silly. I don't know that you would make a cat or plot for your next manuscript, but it shows you the flexibility that you can customize the plotting symbols. You can customize the colors to do anything you want. Go try to make a cat or plot in Excel. Good luck. It also allows you to think differently. There are numerous packages in R, Python and other languages for data visualization to do things like interactive plots and analysis. Again, incorporating HTML and JavaScript to make your plots interactive. And again, these are all generated by developers around the world in myriad fields from, say, economics to microbiology that have a need, they solve the problem and then they made it available for others to use or improve upon. Again, this is the beauty of reproducible research. They made their methods available for others to then riff upon. One of the packages I like is called RGL, which is a, I don't know what RGL stands for actually, but it allows you to generate 3D visualizations. And if you were to open this in R, you could put your mouse on the image and you could then spin the image around any way you want. So some people get excited about making 3D plots in a 2D medium, which I think is totally worthless. But here what I have is a GIF of the visualization from the Kozic data analysis, where I've got the red and blue balls and I have it animated to spin the ordination around. And so through this GIF, you now see it in three dimensions, which is pretty cool. We're not to the point where we're ready to submit this type of figure for a manuscript because most of our manuscripts are too tied to a PDF type format. But again, for a presentation, this type of GIF and this type of visualization could be very attractive and allow people to see things in three dimensions. Another example is the GGAnimate package that builds upon a series of plotting tools called GGPlot. And here is data from the famous GapMinder dataset looking at the life expectancy of different countries by their GDP per capita. Each dot represents a different country. The color is according to the continent they're on and the size of the point indicates the population. And it's animating over five year increments. So again, this is a really cool way, perhaps not perfect for every type of analysis but it's a really cool way to represent data. And instead of seeing images as being static like they are, say in Excel, we can now animate these images to get a better representation of, say, temporal patterns. So this plot has got one, two, three, four, five different variables going on in a fairly simple and attractive way. So again, that's fancy and all but it's eye candy. That's like an advantage of our Python. It's eye candy. So don't lose track of the ability to make an analysis more reproducible is because we can programmatically generate these plots. With several lines of code, we're able to generate those very complicated plots and if we wanted to change something, we could alter the line of code, rerun the code and get a new version of the plot. That's the advantage of the scripting languages. The eye candy is an extra benefit. So there's a difference between what I will call interactive versus programmatic analysis. You can run R and other scripting languages in an interactive mode where you go into R and you can write commands at the console. And this is similar to perhaps Excel where you go into Excel and you start manipulating things and you generate a plot. So this has many of the same limitations between going in interactively and using R and using Excel because you're getting into, unless you have a way to document the code you're entering at that command prompt within R, then it's really no better or different than using Excel. The difference between an interactive programmatic analysis and a programmatic one is that a programmatic one will use a script to store the commands for repeated and automated use. So the big difference I wanna drive home to you is that if you're using R or Python to make plots, it's critical that you're recording those commands in some type of script file that we'll then be able to use as part of our analysis driver file. So we've already made a couple of scripts using Bash and Mothers syntax. Scripts are text-based files for automating analyses. They're a way to tell the computer, so to speak, how to make a paper airplane or something more useful. There are many formal languages out there for programming. So we've already used Bash, that's a type of language. I've mentioned R, Python. There's others like Perl, Julia, Go, Java, C, C++, Fortran, Pascal. These all have varying usages and varying popularities across the field. I don't know all of these. So which language should you choose? I don't know. Ha ha ha ha ha. So I'll help you to narrow the list. I would tell you to choose between R or Python, okay? I know R pretty well. I can read Python. I can't code in Python. I know, maybe that's a disappointment to you. I really like R and I find it does everything I want. The others are significantly harder to use. They're either or they're not widely used or aren't fully developed yet. If you go into the social media, you'll see people getting excited about the latest, greatest programming language and resist the temptation to go run and play with that. I think it's best to pick one language to learn really well and get the most out of that you can. And once you know that language pretty well, well then go pick up a different language. I would encourage you to also start asking questions as you're picking a language to use. What's your lab mates use? The people around you, your community, those other people on campus around you, what do they use? As someone that's a PI and has various people in the lab with different skillsets, it's really important to me and valuable to me for everybody to know the same language. And so my language is an R, I'm sorry, my lab is an R lab and we have everybody in the lab that uses R. And the advantage then is that if you need to solve a problem or you need someone to check your code, who would you ask? And so people in my lab, if they run into a problem, they know I know R, they know their lab mates know R. So go to those people and you can then get help from them. So if you're the only person on your campus using this new programming language, well, where are you gonna get help? Who's gonna check it out? Who's gonna help you to grow as a programmer? So I would encourage you to find your community and to nourish it, to find those around you that are doing some more types of analysis, figure out what they're doing and go with that. That is really the number one critical factor in picking a programming language to use. The other thing I would tell you is to beware of programmers and so-called language wars. If you do a Google search for R and R versus Python say, you'll get numerous blog posts that are just flame wars about why one language is better than the other. Those are horrible. I hate those blog posts because they're not productive. And usually the points that they're differentiating between R and Python are so nuanced that a beginner isn't going to benefit from those. Yeah, I know there's problems with R, but you know what? I know how to work around them and I can work around them a lot easier than say learning Python or learning a different language. So pick a language where you've got a great community around you and go with that. The other thing about programmers is that these are bros, these are guys typically. Who are online and they're very happy to tell you how smart they are and to tell you why you're doing it wrong. And be aware of that, but also be aware that that's not everybody in the community. And there are programmers in the R community, there are programmers in the Python community, but that's really not the community. And so from me, I'm sorry if you run into those people, but no, that's not the most of us who are using R or Python. So like I said, my lab, my institution, most of us are using R. There are people here using Python, that's great. We learn a lot from them, but for my lingua franca and the lab, we use R. I really like the broader R community. A lot of great people, people that have helped me out as I've grown as a data analyst. I'm not gonna try to teach you R here. There's other resources available for doing that. One I would point you to is my minimal R tutorial that's also at the riffamonus.org website. Pretty much anything I'll say for R is also true for Python. Again, there are nooks and crannies where there's differences, but those frequently aren't super relevant for people getting going and programming. My goal here today in this tutorial is to demonstrate how we can programmatically analyze data. The tooling that we use to do that isn't super important. So there's some general principles that I wanna go over with you in this tutorial. Some of these are specific to R, but most of them, like I said, are pretty general. And so we're gonna talk about worrying about the outcome, not the path, don't repeat yourself, defensive programming and testing, setting the random number generator seed, package versions, and then some R quarks to avoid. So the number one thing that I tell people is that the beauty of a language like R is that there are many ways to achieve the same goal. That's also perhaps the biggest problem for many people, is that there is no one way to do something. And so I would tell you, don't worry so much about whether you wrote the most efficient program. You might write an analysis that takes 100 lines and I might write it in one, but if we get the same answer in the end, that's a bonus, that's what we're after. You can go back later and improve and tighten up the code. Your code that might be 100 lines might be a lot better documented. Mine might not be so well documented and it might be too clever by half. At the same time, your 100 lines of code is gonna be a lot more difficult to maintain than say my one line of code. If your analysis is taking a long time to run, then worry about your efficiency. There's a lot of tools out there and tricks for speeding up your R code. And again, beware of the language wars where people get off on telling you how much faster their language is than another language. What I find is a lot of those language war fights over speed, they're typically cooking the analysis to make their language look a lot better than another language. So beware. So the next principle is don't repeat yourself. We call this the dry principle. So you might have lines of code that you use multiple times in a script or across scripts. So the dry principle is that you take those lines that you're repeating and you put them together in a function or use them to define a variable. And then whenever that code is present in your code, you'll replace that with a call to the function or the call to the variable. And so that way then, instead of maintaining say five copies of a code, you now only need to maintain one. And this significantly limits the headaches and difficulty in maintaining your code. So some examples of where I see this come up frequently is in terms of defining your color and symbol schemes and plots that I might like to have simple variables to define my colors or my plotting symbols. Also having utility functions like this one here for calculating the Shannon index. These lines of code, it's a pretty simple function I'll admit, but if I had to calculate the Shannon index three or four times across the data analysis, it sure would be nice to only have to worry about one version of that in case bugs crop into my code. And sure enough, when I use this function, I find an error. So here's my function, calc Shannon. I've defined some O2U counts. I then plug this into calc Shannon and I get a vector of numbers out when the Shannon should only give me one number. It should give me the diversity. And so if I look at my code, can you see where there's a bug? Ah, I see it here. I have negative one times relative abundance times the sum of the log relative abundance. This relative abundance term is supposed to be inside the summation. It's supposed to be negative one times the sum of the relative abundance times the log of the relative abundance, okay? So again, if I had this repeated five times across my code for my analysis, I'd have to be sure I updated it in every place. Instead, if I have calc Shannon, I only need to change it in one place. And that's what I've done here and now I have the correct value coming out of calc Shannon. So again, to correct the bug in the calculation of this index, or to make it more efficient, say, I only need to change these lines of code, not everywhere that it's repeated. And I have fallen in the trap of repeating my code and finding that when I go back and fix a bug that I don't always replace every instance of that bug. And so really having this principle in mind, the dry principle of don't repeat yourself, really will make your life so much easier. So as your analysis grows, assume that you've got three figures from this Mirian microbiome study and you wanna use the same coloring and simple scheme for each figure. So to make things compartmentalized, you put the code for each figure in a different file. And I've already shown that where when we looked at my GitHub repository, I had in my code directory of a file called generate figure one, generate figure two, generate figure three, four or five, right? And so I might have five figures for generating figures and three or four of those figures might use the same coloring scheme because I'm looking at, say, early and late time points in different ways. But I'd like to use the same coloring scheme across all my plots so that when a reader looks at them, they quickly know what's an early and what's a late time point. So I would have to define figure colors and figure PCH in each R script. That is not dry. So alternatively, I could create a file in my code directory called utilities.R. This utilities file then would contain things that I'm going to use across multiple scripts, things like figure colors or figure PCHs. That way then, in each of my five generate figure whatever scripts, at the top of that, I could put the command source code utilities.R. And that way then figure colors, figure PCHs would be brought into that script and it would then have that color scheme. It really is a useful practice, I think, to put these source commands in any library loads at the top of the script. Again, the advantage of this is that your PI might say red and blue are a bit too bright. Why don't we use red and Dodger blue? And so instead of having to change blue to Dodger blue in each of your five R scripts, you change it here and then all the five R scripts that source this file will now be using Dodger blue instead of the brighter blue color. So as you do this and as you get good at using R to script your analyses, your lab mates might find that they really like some of the functions you've written. They really like how you've made some plots or various color schemes you've used and they wanna use them. Collaborators on your papers might really like some of the functions you've written as well. Someone known person on the other side of the world might really like something you've done as well. Don't laugh, it happens. And so what do you do? Well, you can write a package. And so things like GGplot or RGL or GGanimate or Wes Anderson or the Beyonce color palette, these are packages that other people have developed to share with others. There's also been some grumblings within the RStats community that a package might be a useful tool for thinking about how to disseminate a reproducible data analysis. And so if you think about it, a package contains data and it contains functions for working with that data. And so the idea would then be that you could encapsulate all the data and code for a project into a package. I tend to find that my projects are a bit bigger and more complicated than a typical package. And so this really hasn't worked for me. But if you click on this link for making a package out of it, you might get some good ideas that really inspire your own reproducible research practices. The next principle to think about is defensive programming and tests. So I am my own worst enemy. I do all sorts of dumb things that I'm telling you not to do here, or I ignore things that I'm telling you to do here, and I introduce all sorts of crazy things. And so it's best to be defensive, to anticipate craziness to come at you in maintaining mother Sarah Waskott who works with me and I interact with a lot of users who just do weird things that we never anticipated. And so we try to anticipate these weird things. So if someone gives your function the wrong type of variable. So say we had that Shannon function and someone gives you a string of text. Maybe it should complain. Another problem that we frequently run into is dividing by zero. So when we calculated the relative abundance in that Calc Shannon function, well, what should that function do if when we calculate the relative abundance the denominator is zero, right? So these are weird things that you might not anticipate is gonna happen. And it might not be super critical for your own personal use, but as you share your code with others grab your code, having defensive programming built into your code really helps. You can also use a package from R called test that T-E-S-T-T-H-A-T that does automated testing to make sure that as you change your code that your tests, the behavior of your functions continue to work. So again, as an example, we have our Calc Shannon function. And if I say gave it an O-T-U vector of data that were all negative numbers, somehow Calc Shannon still gives me a value back. And that's probably because when I calculate the relative abundance the negative on top cancels the negative on the bottom and everything becomes positive. But perhaps Calc Shannon should be smart enough to know that this is nonsense data. It should be smart enough to know that all of our counts should be positive. And so what we might do is to run a test to say that are all of our O-T-U count vector values greater than or equal to zero. Okay, if it is, then we run those lines of code. Otherwise we send out a warning that says one or more of the values were less than zero and it returns a Shannon value of NA. And so you see what the output here would look like. So again, these are defensive programming practices to test that you're getting the right input that your code is doing the right thing. Some tools that we have that I showed in that example with the Shannon calculator are if functions, if blocks, so if, else, if, and else. That final else is to catch any situations you can't anticipate. Again, users do weird things and you are your weirdest user. Also find that the length function as well as the is functions. So things like is numerical, is character, is logical are useful functions for testing to make sure that your functions are getting the right data. Functions like all and any are great. There's a function called stop if not, which will stop your code if a certain condition is not met. And as I mentioned, you can test your code with known examples, but it's better to automate those code tests using the test fat package. So next thing to think about is setting a random number generator seed. A lot of the analyses we do use nonparametric statistics where we're using a random number generator to test the significance of our observations. And so you might remember from our get shared O2Us batch file that at the top line we had set dir and there was an option of seed 1976, 06, 20. Well, many of the functions in mother depend upon a random number generator. So when we do our clustering, there's randomness. When we do the NMDS, there's randomness. When we do rarefaction, there's randomness. And so by setting a seed for the random number generator, we will still get pseudo random numbers to come out of the random number generator, but they will all be in the same order. And so if we run that analysis repeatedly with the same seed, we'll get the same results over and over again. But if we don't set the seed, then we'll get variation. And so I tell people to pick a consistent seed. I use my birth date. So I was born in June 20th of 1976. You might pick one or you might pick some other number. And don't try to hack your data to pick a random number generator that gives you the right result. That's bad. Pick a number, stick with it, use it across all your projects. And your result should really be insensitive to the seed that you pick. And so again, here's an example in R that if you use the runif command, which will generate five random, well runif will generate random number between zero and one. And if you give runif the number five, it'll generate five numbers. So if we runif three times, we'll get 15 different random numbers. But if I set a seed and then run runif, I get five random numbers. And if I do it again, we get the same five random numbers. So this makes the analysis much more reproducible. So again, pick a number, stick with it. It could be your birthday, your anniversary, one, and stick to it across your analyses. Another area to think about in terms of programming your analyses is to keep track of the package versions that you use. So packages change either through incorporating or deprecating options or by changing the underlying algorithms. Again, mother is on version 1.40 and things have changed from version one to version 40. We get emails from people saying, you know, I can't run this certain function. And we say, well, what version are you using? And they say, well, I'm using version 10. It's like, well, you know, that was from like six, seven years ago, lots changed. And so if somebody used version 10 for their analysis, then yeah, you might want to go back and use version 10 to use the same software they were using back then. And so the same happens with any type of code that you use. So we need to know the version numbers of the code being used so that you and others can replicate the code and perhaps understand why there are differences between the versions. There's a function in R called session info that would ideally go in the general read me file. So if you run session info, you're gonna get output about the version of R you're using, base packages you're using, other packages, other packages that are getting loaded. If you want to know the specific version of a package, then you could say package description, name of the package, dollar sign version. And so again, these are useful tools for documenting the versions you use. And so calling session info and copying and pasting this into your read me file would be really valuable. In R, you can get old versions of packages using the dev tools package. And so this line, for example, would allow you to get an old version of ggplot. And again, you can get older versions of packages using this. Another package that's useful for this is called packrat. Packrat will install specific versions of packages in an isolated manner that can be transferred between people. So if you go into a project, instead of updating your package across your entire computer, it updates the package or uses the specific package within your single project. And then those packages that you're using can be transferred and be replicated by others. There's also a bunch of complex systems that really only demonstrate, I think, how hard software versioning is. So there's a thing called Docker. We've talked a bit about Amazon machine images as ways to keep track of versions that are being used in a data analysis workflow. There's also some R quirks to appreciate because of their effects on reproducibility. And so there's a couple functions in R that you should never use. I think the original R developers regret putting these into R. And so these are setWD and attach. SetWD, you'll more frequently see, this is a way of basically changing directories within R. So again, we want to use relative paths relative to the route of your project directory. And so when you write your R code, assume that you're running that R code from your route of the project directory. Don't use setWD to move in, just like we don't want to use CD to move into different directories using Bash. We don't want to use setWD to change our working directory. There's another hidden file called the R profile file. And if you use that, that is a file that runs code, R will run the code in that file when R starts. And so you can hide stuff in there. Your project has an R profile file that has some stuff in it. And so if you give people your R code, you need to be transparent about what's in that R profile file because any R code that you're running as part of your project is also running what's in that R profile file. And so you'll see in our repository that that R profile file is part of what we're keeping under version control. Also, when you run R, do not save on quitting. And so when you're running R in an interactive mode, it will ask if you want to save. And so you do not want to save and you do not want to restore when you start R. And that's because you might be running an analysis and you define variable A, you quit, you save, and then when you run R again, you start again, and if it restores the data, then A is still live. And A might have a value that you don't anticipate. Or if you give me your code, I might not have A or the version of A that you have. So it's really best to not save the data from your session and not to restore the data from a previous session that saving and restoring really limits the ability to reproduce other people's analyses. So there's other R tooling that are useful for helping with reproducibility. There's R functions, R packages that allow you to interact with files from other software packages like Excel. So I'll get Excel files from my collaborators. And again, I want to keep my raw data raw. I don't want to muck with that Excel file. Well, there are R packages that allow me to read in data from an Excel file to then work with in my data analysis. So I don't have to mess with that Excel file or modify it in any way. There are also tools for checking your metadata formatting to make sure that if you have a date that it's properly formatted, or that if you have weights of individuals that you don't have negative weights or you don't have weights that suggests somebody weighs three tons. And there's other tools for plotting, all sorts of amazing stuff that allows you to generate all sorts of different types of data visualizations. So again, in thinking about raw data staying raw, we have a variety of read commands in R for importing data from other formats. As I've mentioned, OpenXLSX allows you to read in Excel files. There's a Haven package for reading things in from SAS or other statistics software. There's a Google Sheets package for allowing you to access and manage Google spreadsheets. And there's a variety of tools and APIs for accessing web content. There's a R OpenSci is a group of R developers that are working with public databases to get access using R. So I can use this last one, R Entrez, to do searches of GenBank and PubMed within R. And so that's a really beautiful way to interact with other databases. So for checking metadata formatting, I love this example from Christie Blahai who I've mentioned in an earlier tutorial where she did an analysis and she did a statistical test to find that corn is different from corn. But where corn in the second case somehow had a space added after it and R thought that those were two different variables. So this is a fun hashtag, other people's data. If you feel like you've got it bad, go check out what people are posting under this hashtag and you'll see all sorts of crazy things that people are accidentally doing. And so some tools that we can use for checking metadata include using the summary function to make sure that we only have corn, we don't have corn space, that things are in the right range, that things are the right data type. You can also use Table as a function to check for the number of different types and their frequency. You can use things like GSub to clean up the text. GSub is a function for finding and replacing text. And so again, if you have programmatically listed out your steps for cleaning up the data, then you can rerun that code without touching your raw data. So Christie could run this GSub command on her data set and without changing her raw data, it will preserve that typo, but in her analysis, she will have corrected for it. And as I mentioned, there's a package called Validate that allows you to check variables, certain criteria. And then you can also use Google Forms that allow you to control for data entry. And so then using some of the Google Sheets documentation, you can use R to interact with Google Forms as well as with Google Spreadsheets. So great. How do we run R outside of R? Well, we can, you know, from within R, we could say like we did earlier, source, code, plot, and mds.r. We saw this with running that utilities file. From the command line, we could type R-e for execute and then this line of code, source, code, plot, and mds.r. We could stitch together multiple R commands with a semicolon. So what are we doing again? We've talked a lot about using R and the wonderful things we can do with R and of course to be fair, you know, many of these things, pretty much all these things we can do with Python and other languages as well. But our goal here is to generate code that is robust to errors that we might incorporate or others might incorporate so that the analyses can be replicated, but also to make code that's better and more reproducible. And so by automating our code, we can make that code in our overall analysis much more reproducible. Again, I didn't mean for this to be a tutorial in how to program in R, but I wanted to highlight some of the tooling and approaches that you might take to make your analyses more reproducible and more robust. At long last, we're going to go back into AWS and we're going to add some R code as we close out this session. So I'll go AWS and hopefully you've gotten accustomed to doing this like second nature. Grab the IP address and we're in. Great. So we'll cd into Kozic. So I'm going to create a new file in code called plot underscore and MDS. So I'll do nano code forward slash plot and MDS dot R and I'm going to copy this code in here and save it. And to run this, I want to double check that I know what the dependencies are and what it produces. So the dependencies are a 2D access file generated by the NMDS command in mother. And so when I run plot NMDS, I need to give it the name of that access file. It's going to produce a file called results figures and MDS figure dot PNG. Great. So this will be very useful information as we put this into our analysis driver batch file. So I'll save this and come out. And so I will go to the bottom of this file and I'll say construct NMDS PNG file and I will then do R dash E source quote code forward slash plot NMDS dot R single quote close parentheses. And then I need to call this function. So plot NMDS parentheses. And then in single quotes, I'm going to put the name of the NMDS file and then that's going to end in double quotes but I forgot the name of the file. So let's save this back out and then do LS data mother and then it ends in NMDS dot axes. And so here is the horribly long file name. So I will copy this and I will go back into my analysis driver batch file, come back to the end and inside those single quotes I will paste that name of the file. And so I'm going to copy this line because I don't want to run the entire analysis driver batch file. I just want to run this line and so I will quit out I will save it and I will run that. It says there's an unexpected end of input and I think that's because I have some weird spacing going on when I copied and pasted. So I will get rid of that space and run it again. Oh, there's another space up here. These are the problems with copying and pasting. You know what? I'm going to show you a different way to do this. So a useful function as you're building your analysis driver batch file if you want to do this copying and pasting is tail. So if you do tail analysis driver bash we get the last five lines of our file. And so I can copy this now. I think using nano copying and pasting from nano is introducing some weird line breaks. So I can paste this in, it runs. It looks like it built some type of plot and we remember from the header of our plots and mds.r file that it put the figure into results, figures and then nmdsfigure.png. And so if you forget the path, what I'm doing is if I hit tab, hit tab twice it'll tell me the directories that are within results. So then I can do results, figures, hit tab twice, shows me what's there, nmdsfigure.png. But I don't know how to open that. Do you remember how to open that? Right, we can use filezilla. So let's see if we can use filezilla to open up this nmdsfigure file to see what it looks like. So I'm going to come into my applications and I didn't put this in my, for some reason I don't have this in my doc but I'm going to go ahead and do filezilla. And I want to connect and I need to put in my new IP address in here. And so to do that, I need to come over to the EC2 and highlight this and copy and then go back to filezilla and paste it in here and then say connect, okay. And there's my Kozic directory, double click on that. We said it was in results, figures. Here's the moment of truth, how's it look? It looks great. So that's what the PNG file looks like, okay. And so again, we can use filezilla to access the files and move things back and forth and carry on. So I'm going to stop here with filezilla and AWS. And so I'm going to close all these things and I'm going to go back to my terminal window and do exit, exit, exit. And over here I'm going to stop my instance. Actually, you know what? Before I stop my instance, I forgot to do something really important. I forgot to do something really important which was I forgot to commit my changes. So I'm going to log back in. I'm going to then CD into Kozic. I do get status. I see that I've got some other weird things going on in here now that I didn't anticipate. So things like code.plot and MDS SWP. I'm gonna delete that because it's not under version control. I'm not using it. I'll do RM that. And if I do get status, I'm not sure what that R directory is so that R file is. So I'm going to add that to my get ignore. And up here I'm gonna add R. Quit that, get status. And so now I see I've got four files. So I'm going to get add, get ignore, analysis, driver bash, code, plot and MDS.R. Results, figures, and MDS figure. Get status. So those are all ready to be committed. I can do get commit. Generate an MDS figure. Get status. We're good to go. And I'll go ahead and get push. And add my credentials. Now we're ready to quit. So exit, exit. And I will then come back to actions here. Instant state, stop. Yes, stop. So we've already done the first exercise I'd asked, was planning on asking you to do of adding a line to the analysis driver bash script to run the R file from the command line. We did commit it. So what I'd like you to also do is to perhaps add some comments. You did this previously in an earlier tutorial. So add those comments to the code plot at MDS.R file and add perhaps one or two things to make the code more defensive. Perhaps you could confirm the number of columns two in the file. What if the names don't have a D in them? And then run the code and commit the new file and changes. I hope you now have a better understanding for why using tools like R or Python can help make your analysis more reproducible than say using a tool like Microsoft Excel. Beyond reproducibility, I love using R because there's a community of wonderful data scientists who are constantly striving to make the language better by expanding the tools we have for working with data and for making cool data visualizations. If you're interested in learning more about R, I would suggest that you check out my minimal R tutorial, which is also available on the riffamonus.org website. There's an R package that has been a total game changer for my research group. It's called R Markdown. As we'll see in the next tutorial, this is the way to blend written text with R code. Have you ever had to update an analysis and then find that you need to update all the P values, summary statistics, perhaps you have a table with a bunch of numbers in it and you have to update those as well? That can be really tedious, right? It's also prone to having a lot of errors. R Markdown is a way to avoid all that tedium and error. I can't wait to share with you how R Markdown has been so instrumental in improving the reproducibility of the manuscripts coming out of my research group.