 Welcome back to the Riffamonis Reproducible Research Tutorial series. Today's tutorial is all about the importance of documentation. I don't know about you, but when I'm working on a project, my first inclination is to write for myself. Also, I generally have a very high regard for myself and my ability to remember things that I did today, tomorrow, next week, six months from now, even next year. The fact is, I couldn't tell you what I had for lunch yesterday, much less what I was doing for analysis on a project six months ago. I also have a bit of a biased recollection of things. If I were to describe to you how to get to my departmental seminar room, I might alter those instructions based on whether or not you're familiar with the University of Michigan. You may have encountered this with a previous tutorial when we were developing instructions on how to fold a paper airplane. I might make certain assumptions that you would know what a paper airplane looks like and how it's supposed to be folded. My instructions may have not been explicit to tell you to fold the paper airplane, to fold a piece of paper lengthwise. What I was looking for would be a paper airplane that looks like this, a fairly classical model of a paper airplane. If you had folded it widthwise on the first step instead of lengthwise, you'd get a paper airplane that looks a little bit different. It has a shorter body. You'll see that the fins back here stick out the back of the plane. It's a bit of an imperfect reproduction of my paper airplane. Regardless, they still fly pretty well. I still have fun with them, but it's an imperfect reproduction because I made a critical assumption about how you would make that first fold and perhaps I wasn't explicit enough in my instructions. Although writing documentation throughout analysis can be tedious and isn't a whole lot of fun, it's critical for ourselves, albeit next week or six months from now. It's even more critical if we expect someone else to build upon our work. Today we'll be discussing various ways that we can provide documentation throughout a project. A downfall of many projects is to see the final manuscript as the documentation for that project. Really, a manuscript is just a signpost along the way towards other projects that hopefully others will continue to build on. Regardless, there are many other steps going from the beginning of a project to a manuscript that need to be documented. Where did you get certain reference files? What did you do at different stages of the analysis? What parameters did you use? All these things need to be documented so that others can build upon it. And it's even more critical for you if you're going to be picking this project up again, say in six months or in two months after your reviewers' comments come back and you need to update the manuscript. Join me now in opening the slides for today's tutorial, which you can find within the Reproducible Research Tutorial series at the Riffmonis.org website. Before we discuss today's tutorial on documentation, I'd like to revisit some material that we discussed in the second tutorial of this series. You'll hopefully remember this grid that I suggested was a useful framework for thinking about issues in reproducible research. So what I'd like you to do is take a 30 seconds or so and see if you can remember what goes in each of these four quadrants. There are single words that we had that I described going in each of these four areas. So go ahead and hit pause. And once you've come up with a solution, go ahead and press play again. So hopefully you came up with something that looks like this, where if you're using the same methods and the same population as somebody else and you get the same result, we'll say that's reproducible. If you've got the same population or system, but you're using different methods, perhaps think of this as like triangulation, you would say that the result was robust. If you're using the same methods, but on different systems or different populations, we'll say that the result is replicable. And then finally, if you're using different methods applied to different populations and systems, and you get the same result, we can say that the result then is generalizable. So moving on to today's content, the learning goals for this tutorial are to first demonstrate that every threat to reproducibility is grounded in the ability to document one's work. That every problem that we have with reproducibility is ultimately a problem of documentation and communicating with the people that follow us. Second, we want to identify the various forms of documentation that can appear in a reproducible research project. We will then articulate the importance of scripting to document the processing of raw data to a final product. Next, we'll review the written documentation of a project and provide recommendations. And finally, we'll critique and generate self-documenting directory, function, and variable names in various scripts that we might use. So a guiding principle in thinking about reproducibility is this quote that is popular among many people within the software carpentry community, but nobody really knows who to attribute it to. Your single most important collaborator is you, six months from now, and current you doesn't have email. So how would you go about documenting your project if you knew that you had to drop it for six months, come back, and get going again? Say you have to go to a conference or you're going on maternity leave or you're going to be teaching for a semester. How would you write your documentation? How would you go about doing your project today anticipating something like that so that when you pick it back up again in a few months or a few weeks, you're ready to go again? And I and many other people would agree that if you can do that, if you can document your project such that it's not that difficult to pick it back up again, then anybody else should be able to easily pick it up as well. And so by satisfying you as your most important collaborator, you will then satisfy future collaborators. So today to discuss where we're going, we're going to talk about various types of documentation and specifically written documentation and then a number of other areas where we can document our project without really writing out pages of documentation. One area is a principle of keeping your raw data raw, using data organization as a form of documentation, the idea of scripting everything, the dry principle of don't repeat yourself, automation, and transparency. So we're going to be discussing each of these in this tutorial, but they will all come up again and again and again in subsequent tutorials. So if they seem a little bit abstract right now or if I'm talking about commenting code, know that we will be doing more of commenting code later in the tutorial. So this is an introduction to these different ways that we can provide documentation. So where have we seen the importance of documentation already? So I think this is the fifth tutorial in the series. Where have we already seen the importance of documentation? Well, this is a partial list, and so one area we might think of is our ORCID ID, which provides a universal way to connect to somebody. It kind of aggregates all of their scholarship. We might think about the data description fields in a fig share data set, meadows at all in that paper we looked at, used our markdown documents, which combined both code and text. There's a project description in GitHub for each repository. When we wrote that readme file for the paper airplane exercise, that was another example. And then also the project license that you may have seen in a few of those projects, that that's documentation about the permissions that the authors give to subsequent researchers for how they can interact with and use the code. And then also when we were doing the paper airplane exercise, at the bottom of the page we'd write little text about what we did to update the code. And those are called commit messages, which we'll see in subsequent tutorials when we talk more about version control. So all of these are examples of documentation that don't involve us writing copious lines of text describing how we did something. The documentation is baked in to these various aspects of how we're doing our project. And that's a great practice. So one of the really useful tools is what's called a readme file. And again, we've seen these in various GitHub repositories that we've already looked at, including our paper airplane example. And so what makes a readme file useful? And so I think to help with this, let's look at a couple different examples. So we'll look at the Meadow et al. microbiome study, two papers from my lab, one from Mark Z and colleagues looking in Mbio, as well as a paper from Sarah Westcott and myself. And so as we look at these, what I want you to answer, the questions I want you to answer are, what do you notice about them? What other information would you like to see? And how would you change the readme you wrote for that paper airplane example in the previous tutorial? And again, I'm showing these examples not to say one is infinitely better than others, but I think they all have flaws, they all have weaknesses, and we can learn from each of these to make better readme files in the future. So I'm going to go ahead and click on these and open up in separate tabs. And so this is the Meadow et al. readme file. I guess my link goes straight to the readme, but you'll see if you click on that link that as we recall, GitHub will render the readme file on the main page of the repository. It's also here in nice bold letters, readme.md. And so if you look at this, these scripts detail statistical analysis of bacterial communities found on classroom surfaces. Okay. And there's some amount of information here. It's pretty minimalistic, I would say. Something, if I was searching through GitHub and came upon this, I might wonder, well, what's the title of the paper? Where is the paper? How do I get a hold of that? Here is from Mark Z's obesity paper. And here he's got a title. I believe this is the abstract of the paper. And some information about how to analyze the full dataset, how much RAM you might need, how long it takes to complete. And then there's an overview here of the organization of the project. And so you'll see the project directory. This is the top level readme. There's a folder for documents, data, code, results, scratch, temporary files, other files, and a make file. Again, if we click on this link to get back to the parent, we can don't see a link to the actual paper. And so maybe that would be a good thing for Mark and I to go ahead and do since this is our own paper to tell people where this paper exists. And yeah, I mean, it's got a lot of good stuff going for it, but at the same time it's missing something. Something you might also notice is that up here there's directory for submission, but down here there's no directory for submission. It says there's three files, study, RMD, study, MD, study, HTML. And those are probably here in this submission directory where there's a lot of other stuff that doesn't necessarily correspond to what we saw in the readme file. So there's documentation, but it doesn't totally align with the documentation as it actually exists in the repository. So speaking from experience, you're so excited to have the paper finished that generally you don't go back and tidy up these little pieces, but it's really important because again, if somebody wanted to come along and say they did want to add their own data set to the pipeline, we need to improve the documentation so that it's easier for them to do that. So here's the third one, a paper from Sarah Westcott and I. Again, title of the paper, the abstract organization, different data files that we used. It's a directory called submission along with the various files in that directory. Down here is a list of the different dependencies you need to run the analysis. So what versions of the software we used, where the things need to be installed, and then how you would go about building the paper. So if you go to a prompt within their repository and you type make write dot paper, it'll run it. And then links to the data sets that we used in the study, links to the PubMed versions of those papers. So again, more organization, more documentation, still no link to the paper, which is pretty bad of me. And so we should go back and fix that. So hopefully by the time you're watching this video, I've gone in and updated to indicate where it goes. And so a good place to do that might be right here, paper describing opti-class method. So let's go ahead and just do that. And I'll just... So what I did is, up here there's this description, paper describing opti-class method. You can press edit. And then we'll say paper describing opti-class method published in M-sphere. M-sphere. And then let me go to PubMed and get the PubMed link. So if I do Schloss, and so I'll go ahead and grab that link, pop it into this description. Maybe I'll put it in the website, actually. We'll see how that works. And do save. And so then you now see paper describing opti-class method published in M-sphere. And here's a link to the paper. So that makes things really nice and it links it back to the paper then. And of course in the paper, we describe the presence of this repository. Another thing that we've done is that we've created a topic that's called within GitHub, called reproducible paper. And so what you can do is if you click on manage topics, you could type in reproducible-paper and it will pull that up. And it'll be basically a tag on your repository. And so if you then click on reproducible paper, this will pull up other reproducible papers that people have been publishing and that they've tagged. So here's somebody that put together some tools for doing reproducible research. Other things from my lab. Here's a reproducible paper from Ben Marwick. And so here again is a really nice README file describing the repository contains research compendium of our work from the 1989 excavation of Malacca Nuja. So it's nice to be able to look at other people's repositories within this reproducible research tag to get an idea of how we might improve the reproducibility of our own projects. Okay? So again, like I said, none of these files are perfect. They all leave a little bit to be desired. But as you interact with other people's repositories or as you look at what other people have done, critique them and think about what's good, what's limiting, and what can you bring from those to improve your own study. Okay? So I'm going to go ahead and close these tabs out. A big problem with this, like we saw with the Meadow and Z examples, is that documentation is a very thankless task that no one wants to spend a lot of time on when they could be analyzing data and writing a manuscript. But again, in the long run, you can build upon your work or if you want to be able to communicate with you in six months, it's really critical to have that documentation to have that form of communication. It really helps to have a readme in the root of your project. And that then provides navigation, can provide navigation to anyone coming into your project. A nice feature of GitHub is that it will automatically show the readme file when you open the repository's webpage. It's also useful to provide this directory structure so a newbie coming in knows where everything is or should be, right? Or if your PI comes in and needs to make a figure for a presentation, how difficult is it for them to find the data and the code to build a figure from your previous paper? You can also then specify software dependencies, versions and where they should be installed, and then give instructions on how someone would enter the project and what they would do to run your project. And so this readme that we're describing is in the root of your project. It's at the highest level folder of your project. And so something you might also consider would be putting a readme file in some of these subdirectories. So putting a readme file in your data directory to perhaps explain where did the data come from. Or putting a readme file in your submission directory to perhaps describe where the manuscript has been submitted or what conferences you've presented it at. The next thing I want to talk about is keeping your raw data raw. And this is really critical because by keeping a raw version of your data, you can always start over. You can always go back to a blank page if you will. Whereas if you've got a spreadsheet that you're working with and you start editing things where perhaps people formatted the date differently when they've entered data, if that's the only version of the file you have, then if you screw up or if more data are added or if there's different versions, it quickly becomes a headache. And so it's a really great practice to keep your raw data as raw as possible and to even if it's got all sorts of formatting issues and different problems to keep that file somewhere within your directory structure for your project, so that you can always go back to it. As an example, I've done this frequently. Have you ever accidentally sorted a spreadsheet by one column? So one column now is sorted alphabetically, but the rest of the tables, the rest of the spreadsheet didn't get sorted with it. I think that's a very common problem. And so if you have a raw version of that, that's not such an issue. You can always go back and figure things out. We can also think of data organization as a form of documentation. And so have you ever been to a conference or a seminar? And you get to see somebody's desktop and you see it's covered in a thousand files. How efficiently do you think they're able to access the information in any one of those files? How easy would it be for them to find the file they want? You might also ask, you know, what's the likelihood that one of those files might get lost or deleted? And perhaps over six months I maybe have, as you've perhaps seen, I've maybe have like 20 icons on my desktop. And I know that I probably have too many there because it's very easy to lose track of things. And if I accidentally delete something and I don't notice it for another six months, then that can be a big problem. And then similarly, say somebody can make it work with a thousand files on their desktop, how likely are they going to be able to teach that system to you? So if you've got to come into the system and understand where the code is or where the raw data are or their naming scheme, how difficult is it going to be for you to pick that up? And so this is where it's very valuable to have separate directories for your code, your data, your figures, your tables and your documents and make it very clear where the different types of files should be located. If you think about the repositories we looked at, those with better organization, it was very clear what different types of data should be. So next, we want to think about scripting and that we want to be able to script everything to convert from raw data to processed data. And so by scripting everything we can make it automated so that I don't have to question whether what parameters I used. If I'm providing you a script that has explicit instructions that the computer's following to complete the analysis, then that's going to have all the parameters. That's going to have the location of the code. That's going to have the name of the software I used. Your scripts then become that lens that you and others can use to see how you've analyzed your data. And your instructions then become better as you become more explicit. So if your computer knows what you're doing or what you want to do, then it's a good chance that others will know what you're doing too. Also, if you cannot, if you don't allow yourself to manually manipulate the raw data, then you have to be explicit in how you document the processing of your data. Sometimes I've worked with a tool called Arb ARB for working with 16S sequences and it's got a nice graphical interface that I haven't figured out how to work with from a script. And so if I'm working with that software, I have to be very explicit about what buttons I push, what toggles I flip, what parameters are set, because if I have to come back and do it again, I know I'm going to have to do it. And I'm going to have to remember how to do it. And I maybe go into Arb, I don't know, once or twice a year at most. Also then although the computer might understand what you're saying, you might not understand what you're saying. So your code needs to be interpretable. And we'll talk a little bit about this in a bit, but you need to provide comments for your scripts so that when you come back for it to look at it in six months, or when others come and look at it, they need to understand what's going on. Related to scripting is automation. And automation is helpful because it'll keep track of the ordering and maintenance of dependencies. And so if you have a really complicated data analysis workflow and if you say add a data set or add a piece of data or move a piece of data, off the top of your head would you know what steps to repeat to kind of update the overall project. And again as those projects get more complicated, it gets a lot harder. And so by automating your workflows that will then keep track of the ordering and the maintenance of your dependencies. And so a question to ask yourself is if you had to add more data how difficult would it be for you to update your analysis? And would you remember all of the steps? And so one of the goals for that paper with Mark Z was that if another person publishes a paper that has obesity data some BMI information and 16S data, we would like to be able to add that data set and right at the end there make right dot paper and have it regenerate the plots, regenerate the whole paper including that new data set. I think that's the ideal, it doesn't always work out that well, but again thinking about automation in those terms it's very easy to see how that would facilitate greater reproducibility. Again in these large and complex analyses it becomes very easy to lose track of where you are in the analysis and by again scripting it so that you've got your workflow down for the computer to read then it becomes much more explicit. And of course not only do we have software dependencies but we also have data dependencies where if we're doing say classification of 16S sequences well there are all the upstream steps going from a fast queue file to that sequence that we're trying to classify but then there's also dependencies like the reference files that we use and those types of file dependencies that go into our analyses. So next aspect of documentation we want to think about is transparency and so how well are you going to document your analysis if you know that someone else might look back at what you've done and so this is the question that I realize it would be fairly cynical about but I'm hoping that you think about this question in the spirit that it's intended that again if the intent is for somebody else to eventually build upon your work which is the goal of all science I hope then you want to be transparent you want to help future you as well as future collaborators and so you might want to write in documentation that is explicit about why you've done different things or why you've done the things you've done why did you pick the RDP training set for classification versus the silver or the green jeans training sets why did you pick this particular parameter value and so by being transparent about that and being transparent about your documentation it then is much easier for others to build upon so science will only move forward when others can build off of our workflows if you're trying to have sharp elbows and keep people away from what you're doing be opaque, don't tell people why you're doing things but I can almost guarantee you that nobody's going to build off of or want to work with the data that you're doing you could put all your data all your code up on github but if you don't give me a roadmap if you don't tell me what you did then it's kind of worthless so a tool that we have for doing all this a lot of this documentation is using plain text and we've already talked about that a bit and as we described and worked with Markdown as we were making the paper airplane repository and writing out that readme file and so by using Markdown we put greater emphasis on writing the text rather than the formatting you know we use an asterisk to denote a bulleted list or two asterisks around a word to denote italization, some emphasis and so it's easier for computers to read text plain text than say a word file and so it's much more portable it doesn't depend on another person having the software to read it a text file is a very generic version of a file anybody can use it we can also use code to generate text which we'll see in a future tutorial so when we write this documentation we've already talked about one file for documentation which is called readme what does that file name tell you it tells you to read it it's very explicit about what's going on in that file sometimes you'll see a file called install right and so that's going to probably have instructions on installation so it's very important also to put documentation into our file names and so thinking long and hard about what we name things and there's a few rules that you want to keep up on so one best practice one rule is don't put spaces in your directories or file names yeah it looks prettier to the eye but it ends up causing big problems for computers because the computers frequently freak out at spaces and so we really discourage the use of spaces in directory or file names and so as you can see here the example of raw data you could write that as two words the computer would see that perhaps as directories but if you wanted to be a directory called raw data you would use raw underscore data or you could blend the two words together to make it raw data and the same goes for files like build figure one you could have it as underscores or using capitalization to denote the boundaries on individual words it's also helpful to be descriptive for comic and logical right so don't name your directory stuff don't name your directory data raw or data mother name a directory data and then within data have a directory called raw and have another directory called mother that you can then use the directory structure as a form of documentation to show you where the different types of data are and data you know what that is if you call something stuff well I guess there's stuff in there but I don't know what the stuff is as we move to thinking about our code there's always a goal to make our code what's called self documenting where we want to pick meaningful variable and function names in our scripts again we don't want to call a function or a variable stuff we don't want to call things foo or bar that's kind of a generic variable name that a lot of programmers like to use we want to use meaningful names we want to choose a casing strategy and be consistent about it so we talked about this already in the last slide about directory and file names that we don't want spaces in our variable or function names and so using underscores what's called snake case because the text looks it's all lowercase and it looks like a snake alternatively there's camel case where the first letter of the name is lowercase but then you use capitals to denote the word boundaries so it looks kind of like a camel alternatively you could also capitalize the first letter but the key is to be consistent pick one type of casing strategy and stick with it try not to mix snake case and camel case I have done this frequently and as I work with these scripts more and more I frequently forget was that variable in snake case or was it in camel case but if I always wrote in snake case which I'm trying to do more and more then I don't have that question I don't have to worry about it depending on the programming language there's other symbols that you probably want to avoid because those symbols are going to have some baked in meaning in R you generally want to avoid using periods and hyphens and other symbols in your variable names but the only symbol that I want to use in a variable name is an underscore something else that you might try is applying some grammar to your variable names and so if you've got a variable give it a noun as a name you can think of a sequence as a variable name for a sequence that is a noun for functions we might think of giving the names verbs so cluster sequences plot data generate figure 3 if we have a variable that's a logical think of it as a question use the variable name to be a question is sequence and so then we would expect the answer to be true or false yes or no by using grammar nouns verbs logicals we can think about how we can name our variables better to make them a little bit more self-documenting so that if I see a variable called cluster sequences I know it's not holding a value but it's holding a function so based on that I'd like you to look at these file names and how would you fix them if you think there's a problem and so if there's a forward slash at the end of the name think of that as a directory name if there's a pair of parentheses think of that as a function name and if there's a dot letter like dot r generate figure 1 dot r think of that as a file name so look at these and how would you suggest that somebody change their names the variable and file and directory names and function names to be more consistent to be more helpful so I'll let you take go ahead and pause this go ahead and pause this and answer this question and edit these variable function file and directory names to be more descriptive to be more helpful there isn't necessarily a right or wrong answer ultimately this is going to relate to what's helpful for you and again helpful for you in six months when you're coming back and trying to figure out what this variable means so as I said with that last exercise there isn't really a right or wrong answer there are some peculiarities for how we can name files and directories or variables or functions that are kind of built into the environment but for the most part how we name things is a tool to help us it's a tool for documentation and so a lot of people develop what's called a style guide if you've ever read the instructions to authors for a manuscript for a journal you know every journal has a different style guide there's no right or wrong reason for why they want you to site references a certain way or use different headings or whatever but they need consistency so that all the papers in that journal look the same well the same is true with our code and with our data analysis tools that we need consistency we need to develop a style guide and so I encourage you to come up with a style guide with your research group that you might think well I'm working on this project and no one else is working with me and that happens a lot in academia I know but at the same time think about the potential that if you and your colleagues in the lab realize that you're all generating the same type of plot or doing the similar type of analysis you could very quickly see the value in sharing code between projects but if you're all using a different style or a different approach to coding then perhaps going to be of limited use because you're not going to be able to talk to each other it's going to be really difficult to interface with each other okay and so again be consistent within your own project and across projects and then hopefully across your research group and think about what you want people to be doing in terms of programming and approaches that they go you can also poach ideas from other style guides, Hadley Wickham has one for R Google has one for R and Bash links are here on that page and you can click on those to go see the style guides and something to appreciate is that style guides are pedantic and often very arbitrary and their goal isn't to make sense necessarily, their goal is to enforce readability and good practices and how we program and so here is an example from Hadley Wickham's style guide where he says place spaces around all infix operators things like equals plus minus arrow and the same rule applies when using equals in function calls always put a space after a comma and never before just like in regular English okay so you might say well why do I need a space before and after a division sign or a plus sign well more space makes things more readable easier to read well you might disagree with Hadley, that's fine but come up with a style and be consistent about it okay and again Hadley's got a lengthy style guide as does Google for R code so other areas that help you to enforce good coding practices and we'll come back to this in a later tutorial when we talk more about scripting within R file called a .R profile file so pop quiz what does that . mean in front of R profile if you don't remember I'll let you look that up but you generally won't see that R profile file and so there might be some secret sauce in that file that if you give somebody a copy of your directory or R code from your directory they may not get that R profile file with your code and so if you've got a whole bunch of code in there that your script to generate a figure depends on then that person that's trying to generate the figure as well is going to be in for a world of hurt similarly don't save your R session on exiting if you save your R session then you're going to have variables that may or may not persist to the next session again that's not going to be very portable between users, between environments or between projects don't use attach or set WD in R again we'll come back to why these are problems but ultimately they're problems of documentation we're creating an assumption about the structure about the tools that other people are going to be using including yourself again in six months and so it's really helpful then to be able to run everything from the root of your project directory and it's very clear where everything starts and ends and there's no mystery about whether or not you are in a certain directory when you ran a script because you're always in the root of your project when you run every script so again these are kind of getting a little bit away from where we want to be about talking about documentation but they also relate to these issues of transparency in documentation and not making assumptions about future you or future collaborators and we'll come back and discuss these again in a subsequent tutorial another thing to think about is to write code for people to read right so and we want to pick languages to facilitate this goal that languages like R and Python are very popular I think because they're very easy to read another language that isn't so widely used by bioinformaticists is Ruby which is very easy to read other languages like Perl are very powerful and have been very popular but they're impossible to read I've heard them called write once read never languages that you can write the code it will work but you're not really sure why and if you had to explain it to somebody it might be really hard related to this is that there's also a push among some people to write minimalistic code and so there's a game that people like to play called code golf where they're given an assignment do this operation in the fewest lines of code or fewest characters and so you can imagine that there's this push then to generate code that is frankly worthless because it's not readable and it's so cute or sophisticated in a way that it's worthless because nobody knows what it's doing or how it works and so really think about writing code for people to read if your variable has 10 letters in it big deal that might seem really long but if it's a descriptive variable name then that's really helpful it's transparent, it helps others to read what you're doing one of the things my group does as a group for lab meetings each other's code and there are people in the group that know nothing about coding but because people write their code well and use descriptive and expressive variable names people that don't code can participate and can read the code and figure out what's going on in the text and so that's very powerful and I think that's a goal that we should all hope for so related to making our code more readable is to be sure we're commenting our code yeah we want to give expressive variable and function names but that's really insufficient for documenting your code something to think about would be at the top of a script file having some type of explanation about what the script does and what are the expected inputs and outputs to the file beyond just the top of the file if you have more sophisticated and complex functions you also should have better descriptions about what's going on in the code as well as what are the inputs and outputs and then use comments liberally to describe what a line of code should be doing and why very few people put in too much code it's usually a problem that there's not enough code so I have an exercise for you now and what I'd like you to do is copy and paste an R script that I've got linked here from this study from Kozich at all from my research group we'll see this project in subsequent tutorials and what I'd like you to do is copy it into a new text file and save it and I'd like you to think about three things that we did well in this code stylistically and three things I could improve upon comment the code add comments to indicate what's going on and where you have potential questions at the top of the file add a comment that includes your assessment of how readable the code is and if there's anything about the code that you might change and then compare your code, commented code to somebody else so I'm going to go ahead and come out of full screen and open this in a tab and so here's my R code and as I talked about in the previous tutorial we can use our terminal and what I'd like you to do is to create a file that we'll call plot an mds.r and if you're using git bash or if you're using terminal on a Mac or Linux at the command prompt you can open a text editor that's fairly simple to use called nano and so this is again a nice text editor it's not super powerful but it will get the job done for the purposes of this tutorial series and so go ahead and copy and paste the code in here I'm going to hit control o which will then ask me to write the name of the file to write and I will say plot an mds.r and then you can cursor around in here and as you've perhaps noticed at the top here you can make a comment by using the pound sign and so again to hit enter to get out hit control x and so now we can type nano plot an mds.r and it pops back open so again what I'd like you to do is with that R script we've now copied and pasted into our text editor we're using nano as a text editor name three things that we did well stylistically and three things that I could improve upon again just thinking about stylistically don't say oh you're using base R you should be using ggplot we'll save that for another tutorial also but think about my variable names my function names my commenting comment the code to indicate what's going on navigate the reader through the code and perhaps use comments to indicate where you might have questions about what's going on and then at the top of the file add a comment that includes your assessment of how readable the code is and if there's anything bigger about the code that you might change many of the practical aspects of today's tutorial will reappear throughout the rest of the series we haven't talked about scripting yet but we'll again see the need for code hygiene in a future tutorial later we'll make use of a tool called make that helps to document the flow of data from a raw file all the way through a summary statistic that might wind up in your final manuscript similarly in the next tutorial we'll see the need for readme files and structure to separate raw and process data files between now and then please look back at the directories where you have your most recent project do you have any type of readme files that will orient someone coming into the project how well do you comment your code if your PI were to come along and take a look at your directories would they be able to find the code and the data needed to generate Figure 3 that's your homework for the next tutorial until then think about the various ways that we can improve the documentation of our projects and resolve to do a better job of documenting them I know this is an area that I frequently slack on but I do a better job with too