 Welcome back to the Riffamonis Reproducible Research Tutorial series. Today's tutorial is called First Steps for the Reproducible Research Novice. The approach is really inspired by my interactions with and the writings of Christy Bohai and Carl Broman, the two scientists that I really respect for their practical and down-to-earth advice when it comes to improving the reproducibility of data analysis workforce. In today's tutorial, we'll go over some first practical steps to improving the reproducibility of our data analysis. As you hopefully recall from the previous tutorials, we read an editorial called All Hail Reproducibility by Jacques Rovell and Eric Womack. In that, they mentioned several tools. In this tutorial, we're going to use several of these tools by setting up Orchid and GitHub accounts, exploring FigShare, and learning more about Markdown. Finally, we'll do a simple exercise using GitHub and Markdown that will be a useful analogy for thinking about the reproducibility of larger projects. Join me in opening the slides for this tutorial, which you can find as the first steps towards reproducibility tutorial within the reproducible research tutorial series at Riffamonis.org. So let's dig into this tutorial, the first steps for the reproducible research novice. The learning goals for this tutorial include reaching the conclusion that reproducible research is a continual process with no end. We'll evaluate the initial steps that you can take to make your research more reproducible. We'll explore different interfaces that you can use to make your research more reproducible. Then we'll learn a very useful tool called Markdown that allows us to write simple or even more complex documents. And finally, we'll use a tool called Version Control without even knowing it. As I mentioned in the introduction, a lot of my thinking and how we've structured these tutorials comes from Christy Blahai and Carl Broman, two data analysis people that I really respect for their clear thinking and practical advice. And so if you click on the links in the bottom left corner, those links will take you to different blog posts that they've written about reproducible research and the idea of the first steps for the reproducible research novice. So the first point I want you to keep in mind is don't feel like you need to make everything 100% reproducible in your first go. That's a huge goal. And we might like to be smug and think all research should be reproducible. But as we've already discussed in the previous tutorials, that's a high bar to try to reach. Second, you'll quickly realize, as I've already mentioned, that that's a fool's errand. Nothing, nothing will ever be fully reproducible. We can do our best, but over the long arc of time, things change, right? We've mentioned this in the previous tutorials. Databases change, software change. There are differences between different operating systems. And so nothing will ever be fully 100% reproducible. What I encourage people to do is with each project, attempt to add one small element to make your work more reproducible. So with this project, identify something that we do here. Perhaps it's setting up a fig share account and making sure that all of your derivative data that goes into making a plot ends up on fig share. Perhaps it's being really diligent about putting your sequence data and your metadata into the sequence read archive. So with each project, you layer on new skills, new elements to make your research more reproducible. So some first steps that I've already mentioned. You could get an ORCID account. So many journals nowadays require the corresponding author, at least, to have an ORCID account. You can post the data behind your figures into a tool like fig share. You can make your code publicly available. And then also, I didn't list it here because it's kind of, I assume it in people, but make sure that your sequence data is publicly available in a database. And the appropriate database for sequence data is the sequence read archive, the SRA. So the first tool that we're going to talk about is ORCID. And so to motivate this, ask yourself a question. If you publish a paper and your email address is on that manuscript or you put out some tool or some resource that's attached to your email address, do you think your current email address will be any good in 10 years? Right? I'd like to think that my email address here at the University of Michigan will still be valid in 10 years. But I know that 10 years ago, I did not have a umich.edu email address. I was at the University of Massachusetts thinking, well, I'll be there for 10 years. Right? So things change. Sure, there's Google to get a hold of people. Patch loss is kind of a unique name. But if your name is Joe Smith or Mary Smith, you know, those are kind of hard names to Google. And so it's difficult to link back to people. And so the ORCID ID is a tool that allows you to connect your scholarly work with your identity. Right? And so it helps to deal with changing email addresses, changing contact information, as well as the problem that some of us have names that are kind of common. Right? And so the other advantage is that it ties into other services. So if you're using a service like Impact Story, LinkedIn, or ResearchGate, they'll allow you to populate information there using your ORCID IDs. It's also free, which is really nice. And unlike these other services like LinkedIn, ORCID doesn't send you spam. At least I haven't noticed any great amounts of spam. So let's go ahead and click on this link for ORCID ID. And this brings up the ORCID ID homepage. And so as they say, it allows you to distinguish yourself in three easy steps and that the ORCID ID provides this persistent digital identifier that allows you to distinguish yourself from every other researcher. Okay? And it's being used in manuscript and grant submission. It helps support automated linkages between you and your professional activities, making sure that your work is recognized. So what I'd like you to do is go ahead and register to get your own unique ORCID identifier. So we'll go ahead and click on Register Now. I already have an account, but go ahead and plug in your first name, your last name, your primary email, your additional email, and then go ahead and create your ORCID password and you can tell them how often you want to be bugged and go ahead and register. So I'm going to go ahead and sign in to my account and I've been on ORCID for a while and this is something that might require a fair amount of curation depending on what you want to provide. I know we certainly get asked a lot of places for information that makes something like an electronic CV. I think ORCID is probably something worth investing a fair amount of time in, but you can see that it has a number of my publications in here. I noticed there's a few missing that I should probably go ahead and add. It also has information here asking about education, employment, funding sources, keywords, various bits of information. And so again, this is a site that someone could come to to learn about me. To illustrate this, if we go to PubMed and let's look for one of my papers and here's an editorial that I wrote in the journal MBILE that we'll go ahead and click on and this will take me to the MBILE website here and if you look down here under ORCID on the left, on the right side of the screen, sorry, you'll see there's a profile for Schloss PD, profile for Cassadeval A, Arturo Cassadeval. So let's go ahead and see what Arturo already has in here. And so he doesn't have any information that's publicly available, which is fine. If we come back and we look at my profile, you'll see that this is my public profile. So this is what you would see about me if you came and looked at my information. And so you could see all the papers that I've written. And I can imagine that, like I said, this would be useful for people that have a name that's fairly common, that's not unique within science, for people that perhaps move around a lot, for people who change their name. I know several women who have changed their name over the course of their scientific career. But again, it's a useful tool for keeping track of your identity with your scholarly work. And you can see Arturo's had very little information in it. Mine has a bit of information. It links you to my works. And I could add more information if I wanted to. Ultimately, the value of ORCID comes from what do you want to get out of it? And so the first thing I'd encourage you to do is to start populating this with information. It is useful. I also know that we get inundated by a lot of requests to create things like this. Another useful tool is like your Google Scholar profile. But there again, that's one more thing to do. The advantage of ORCID is that it's nonprofit. You control it. The advantage of something like Google, Google Scholar, is that everybody uses Google. But at the same time, it's commercial and you're not quite sure what's going on with the information. And so ORCID is again a tool that many journals are requiring. At least one of the authors, typically the corresponding author, have an ORCID ID. So for no other reason, it makes sense to have an ORCID ID. Great. So returning to the slides, now I want to talk briefly about posting the data behind your figures. This is something really simple that you can do to make your work more reproducible and more transparent. So one question might be, well, what's wrong with posting the data on a private server? My lab has a website. Why don't I just put all the data for my lab on that website? So think to yourself, well, what could be problem with that? And so one problem would be that, again, you might move. You might leave science. Your PI that you work for now might move. My PI, while I was a postdoc, moved institutions. My graduate advisor retired in the last few years. So if we'd have put data on their servers, it would have been difficult to track down that data. In addition, other problems with it is that there's no real standards being enforced on these private servers. One of the advantages of posting sequence data to the Sequence Read Archive is that they have a standard called MIM marks that specifies the type, the minimum amount of information that must be provided to describe your sequence data. So I could submit post data on my own website to be in full disclosure. I have done that. It's not something I'm proud of. And we did post MIM marks data. But at the same time, there's nothing requiring us to do that. And so, yeah, you can post your sequence data to your lab website, but there's no standards. There's no enforcement as to the quality of the data. And so if you post to a third-party website, like the Sequence Read Archive, there are certain requirements and there's certain rubrics that you need to follow so that your data can be posted there. Other options for, say, non-sequence data would be to post your data to a stable website. So one of the problems with posting to your lab server is that lab server might move, the address might change, the PI might leave, you might leave. And so posting to a stable third-party like FigShare is a useful tool for making those data publicly available. And so the advantage of something like FigShare is that it provides a DOI, which is a digital object identifier for each set of data. And so this is a persistent identifier that has a web link, a web address that goes directly to your data site. They also prohibit users from deleting data, right? So once you put it on to FigShare and you make it public, it's public, it cannot be deleted. And then within limits of file sizes, it's also free. So something you might be saying in the back of your head is, well, why don't I just submit this data as supplementary material for my publication? And you could do that, right? And again, I've done that as well. The challenge with that is that oftentimes a journal when they publish supplementary material, they convert it to weird formats, right? So I know people that have gotten sequence data out of supplementary material that's in a PDF. That's not going to be useful for any downstream analysis. Whereas if something's available in a FigShare, you can post it as a FASTA-formatted file. Also, it would be great if all data was available from all published papers or that all data got published. But the fact is that not all data gets published, and we might like to have a way to access data from unpublished studies. There might be studies that are never published. And so using a tool like FigShare makes sure that our data gets out there and that enables other people to use. And again, because it has a DOI, people can then cite and should cite where they got our data from, okay? So an activity that we're going to do now is we're going to go into FigShare and we're going to try to find the data that was described in the Wilmack and Ravel editorial from James Meadow and colleagues, okay? So I'm going to try to find the Meadow et al. article in PubMed. And I will then go Meadow, J, A, U. And that was published in the journal Microbiome. And so this should get us pretty close. And yes, there's two articles that Meadow wrote in the journal Microbiome. And this is the one we want. And I'll go ahead and click on the Microbiome journal version. And what I'm looking for now is mention of the FigShare data. And so hopefully that's going to be provided in the methods section here. So I'm kind of scanning this. I'm going to get rid of this pop-up. Go away. Aha. And so you see sequence files and metadata for all samples used in the study have been deposited in FigShare, okay? So this was one of the things we critiqued their GitHub repository for, their paper for, was that, yes, they make their sequence files and metadata publicly available, but it's in FigShare. It's not in the sequence read archive. And that's great that the data are available. We'd like it to be in the sequence read archive, again, because the SRA is going to impose certain standards on what type of metadata are provided. So we can go ahead and click on this link. And this then brings us to the Lillis classroom surfaces page at FigShare. We can see information about how to cite it. We can download it. It's about 200 megabytes. And it's tagged with different keywords. It's got a description of what's going on. They tell you how to cite it and where it's published. We see that there's a swabs.seqs.fna, as well as the metadata file. So one of the things I notice is that this is an FNA file, which tells me that this is the ATs, Gs, and Cs. It's not going to include for me the quality scores from the MySeq sequence data. And so that's, again, one of the problems with posting it to FigShare rather than the SRA is that the SRA probably would have required them to also post the quality score data. And that quality score data is now useful four or five years later for people that want to come back and re-analyze the data using more modern methods of denoising sequences. Also, I see that these probably aren't the raw data because it's a single fast-day file rather than the pairs. If they did paired and sequencing of the 16S fragments, there should be two files, but there's only one here. So they've given us the 16S data, but it's not exactly the raw data. So, again, don't get me wrong, it's great that the data are available on FigShare, but it would have been ideal for them to have been published, to have been posted on the Sequence Read Archive. To absolve them a little bit, back when they submitted this probably, the Sequence Read Archive was really difficult to work with in getting their data into the website through tools like those available in the Mother Software package, it's now considerably easier to post your data into the Sequence Read Archive, and that's really what you should be doing. Okay, but, again, this is illustrative of the type of data or of data that you can then post to FigShare. You might think of any figure that you have in your paper, the data that you're plotting, you could make a file, you could make an accession like this, and you could post the data that's in that figure onto FigShare. So, say you've got a plot showing the diversity of different samples in people with healthy colons versus those with adenomas or carcinomas, you could post the spreadsheet file to FigShare that was used, you could post that spreadsheet that was used to make the plot here to FigShare that others then could then take the data and do whatever they want with it. And again, this is another somewhat simple thing that you could do to increase the reproducibility of your data, so not only make your sequence data available at the SRA, but you could also post the data behind your figures. And as Meadow did, be sure that you also provide a link, or the DOI, to the FigShare accession in your paper. Typically that would go in the Methods section. So the next thing that you can do to improve the reproducibility of your research is to make your code publicly available. Many of us have no doubt had that problem where you read a paper and you see custom scripts were used available upon request from the PI. Well, as we've already mentioned, people move, people get hard to get hold of. Sometimes that code isn't in really good shape. And so a tool that we can use to make our code publicly available, and it could be code or methods descriptions or anything like that, is a website called github.com. And so what I'd like to do now is go with you into GitHub, and I'm going to log out of my account. And so you should get a website, a web page that looks like this, unless you already have a GitHub account. And so it's a pretty attractive site. They kind of tell you a bit about GitHub. But what I want you to do right now is to go ahead and pick a username, enter your email address, as well as a password, and go ahead and click Sign Up for GitHub. One thing that some people try to do is that it's easy to get many usernames here. So you might have a username for your email, you might have a username for your Twitter account, you might have a username for GitHub. Sometimes it's nice to have a common username across sites, across platforms. That's not always possible, but try. So once you sign up for GitHub, it allows you to then sign in. I'm going to go ahead and get rid of that. I'm going to go ahead and sign in to my GitHub account. And this is a profile from my GitHub account, which if you're just signing up, yours will look much different than this. And so this is kind of the profile of somebody that's fairly active and using GitHub for a lot of their research. I'm going to go ahead and click in the upper right corner here to view your profile. And so you can add a picture. This is our lovely family pig, my name. You'll see I've put my ORCID ID into GitHub, my affiliation, contact information. And if you're just starting in GitHub, you won't have any repositories, you won't have any contributions. But over the course of this tutorial series, you'll start to create repositories and you'll start to have more information here. So we're going to go ahead now and create a repository. So we're going to click up here on this plus sign and we're going to click New Repository. And the repository name we're going to use is our underscore practice. So reproducible research practice. I think if you wanted to use a space, it's going to tell you that it's going to create it as our hyphen practice. So spaces are bad news when it comes to doing a lot of bioinformatics research. So I'm going to call mine our underscore practice for a description, which is optional. I'm going to say a repository to test out my efforts and to make my research more reproducible. You can put whatever description you want, for whatever reason that's what I came up with. I'm going to make mine public. You should too. GitHub allows you to have unlimited public repositories for free. But you'd have to pay if you want to use a private repository. That being said, if you have an academic account, you can use GitHub with an academic account and get permission from them or setting on your account to have as many private accounts as you want. Just be sure that by the time you submit your manuscript that you switch your repository from being private to public. But for this exercise and for the purposes of this tutorial series, everything should be done in the public. We're going to then also go ahead and click initialize this repository with a readme. And for now, I'm going to not add a .getignore file and I'm going to add a license, the MIT license. So it should look something like this. And we're going to go ahead and click this wonderful green button. It's very exciting creating your first repository. And wow, here we go, your first repository. You might recall what the Meadow et al repository looked like and it looked vaguely like this. They had a few more files and I think a directory or two in there. And so now you have a license and you have a file called readme.md. So go ahead and click on this blue readme link. And this brings you to the readme file as it's been rendered. It's got a title, our practice, a repository to test out my efforts to make my research more reproducible. And so go ahead and click on the pencil which gives you a balloon that says edit this file and we see that this now looks like a text file. I can click in here and I get a flashing cursor. So you'll see this file is called readme.md. The md is a shorthand that tells you that this is a markdown formatted file. And so that's different than say like a doc x file which would be a word file. A doc x word file is a binary formatted file that's somewhat proprietary. An md file is a text file that has no built-in formatting, nothing, no special sauce. It's a plain text file. And so what you perhaps noticed was that previously the title of this file within the document was our practice. So we can go ahead and click preview change and we see that there's some formatting created here. So we can toggle back and forth and we can see that some formatting is applied and you'll see a pound sign here to create that heading. And there are various tutorials that you can see online for formatting, markdown. It's meant to be pretty simple. I'm going to go over a handful of simple things with you. Type along with me. I'm going to put this as a heading. After two pound signs. If I do a third one, this is another heading. And here I'm going to put, here is some text for some text to describe the overall goal of my data analysis plan. And if I come down I can say, here is some more text describing how to use markdown. So we've created various headings and if we look at them here we now see that we get different sized headings depending on how many pound signs we use. These green bars in the margin is GitHub's way of telling us that we've added this information to the file. If we come back here, we've seen that we can change heading sizes with a number of pound signs. You can look at this text file and get a sense of the organization. I could add a third heading, or let's make it look a bit more like a paper. If I say this is introduction you can come down and say results, discussion, methods. And so perhaps I'll delete this and clean this up a bit and say this is, or methods would be, let's say, experimental design. Down here I might say DNA extraction and amplification. And you could keep going. And in here I could say we obtained 200 samples from various soils across the state of Michigan. We used a bead, beading DNA extraction kit. We amplified the 16S RNA gene. Perhaps we say we amplified the V4 region of the 16S RNA gene. So we added some text. This looks vaguely like a paper or a lab report. We can then preview the changes and see that this is starting to take shape. So those are headings. Allows you to structure the information. Something else we could do is that we can, this isn't typically done in a paper, but we can add different types of emphasis. So we might say, let's put a single star. So we add a star before repository and we see that GitHub has nicely started to format repository for us to be italicized. And so if you add another star at the end, we see the rest of the text goes back to being a normal font. So if we look at this, we now see repository has been emphasized, has been given, has been made italics. So let's put my efforts after two stars. And so you'll see already that GitHub in their interface here has made that bold. And so here again, GitHub is doing some other magic under the scenes where it's, the strike-through is showing that it's deleting my efforts and adding my efforts in bold. And so down here maybe I'll do three stars. And as we can see from the magic sauce that GitHub is using, describe now is going to be bolded and italicized. And sure enough, describe is bolded and italicized. Another useful tool that I like to do when outlining is to perhaps make a bulleted list. And so here in my results section I'm going to describe the site that I obtained samples from. And so you'll see that I'm using these asterisks on the first line to create a bulleted list. I'll then say differences in richness between types of sites. And I'll say differences in beta diversity between types of sites. This is not a super exciting paper but I just want to illustrate the point that you can use these stars here in the text mode we can visually see that's a bulleted list and we can also then preview the changes and see that on the website at least GitHub will format that to be a bulleted list. Okay? So this will get you far using the stars and the pound signs and the bulleted lists but there's other things that you might want to do like adding hyperlinks or pictures or block quotes and things like that. So how do we figure out what GitHub uses and what Markdown uses to utilize those features. So let's go to our friend Google and I'm going to use GitHub Flavored Markdown and search for that. And what we want this link is going to be the helpful one mastering Markdown GitHub guides and here you'll see a page where they have different examples. We've already talked about different types of headers emphasis, types of lists images, links, things like that. Okay? What I'd like you to do next is within this readme file to go ahead and delete everything and create a title that says how to fold a paper airplane. Okay? And what I want you to do is to take maybe the next 5 or 10 minutes make a list and write down the methods, the materials and methods section for a paper on how to fold a paper airplane. Okay? So maybe give yourself 5 minutes and what I want you to do is go ahead and populate this file with instructions on how to fold a paper airplane. Great. Hopefully you now have instructions on how to fold a paper airplane. Now if we go to the bottom of the page you'll see a rectangle that says commit changes in soft gray text it says updatereadme.md So I'm going to type something in here that's a little bit more informative than updatereadme.md but I'm going to say write instructions to fold a paper airplane. That's it. I'm not going to put anything in this other box and I'm going to hit commit changes. Okay? And so now we see the result is how to fold a paper airplane and at least my instructions. Okay? So no cheating. Don't look at my instructions. Hopefully you have your own instructions at this point. Okay? So what I'd like you to do then is take these instructions and or take the link to these instructions from up here say and email that to a friend and have them go through your instructions on how to fold a paper airplane maybe have them if they're in the lab with you have them show you the plane or if they're somewhere else have them take a picture and send you the plane and then ask yourself does what they folded look like what you had in mind from your instructions? Okay? And what you could then do is go ahead, come back to click edit this page and and in here you could then modify your text to improve the reproducibility of your instructions. You see what I did there? Ha! So based on what they got right or what they got wrong or what they were confused about modify your instructions to improve your instructions and I'm going to go ahead and say obtain an 8.5 x 11 white piece of paper and maybe down here I'll add 9 I'll say use crayons or sharpies to decorate your plane. Okay? So I've added some changes I can then come back down to commit changes and then I can say added instructions to improve the decoration I'll say the design the aesthetics of the paper airplane. Okay? So they encourage me to use fewer than 50 characters I'm not going to worry about that at this point I'll go ahead and hit commit changes Okay? So now I have an improved version of my instructions and the cool thing about this is as you saw by sharing this with your friend anybody can see it these instructions are publicly accessible to anybody. The other thing we've seen is that we can constantly modify this we can modify and improve our instructions over time so I'll click on this history button and see how our document has changed over time and so this again is very powerful in allowing us to to disseminate modify improve our documentation so I'm going to go ahead back to my our practice the nice thing about GitHub is that if there's a file called readme.md on the home page for that repository great so hopefully this was useful in thinking about reproducibility hopefully you know folding a paper airplane is something that at least a lot of Americans are familiar with you might think about folding paper airplanes with people from other cultures and see if that's something they're used to so let's go ahead and come back to let's come back to our slide deck and debrief a bit from that activity what were some of the challenges to reproducibly making a paper airplane perhaps you found you had to be specific about the shape of the paper perhaps you had to think about how do you describe the pointy end or the shape of the airplane at different stages you might have a mental image of what the plane looks like whereas others don't have that same image I think a lot of us have folded paper airplanes in our childhood or playing with kids and maybe others haven't and so if you've got that mental image then you kind of assume things about people's prior knowledge what could improve your ability to improve the reproducibility well you could embed pictures in your readme file right so if you provided more of a pictorial description of the method then that would overcome a lot of problems so think about how does a tool like github make your method more reproducible well as we described your friend can use it it's publicly accessible others could use it and something that you might also think about well say somebody sent you this repository what would you do what could you do with it perhaps perhaps you could say well I'd like to get a bunch of different instructions for making different types of airplanes I was always jealous by the kids they had really cool airplanes that were different than my kind of generic airplane perhaps perhaps you could compile a bunch of instructions to make an encyclopedia of paper airplanes or perhaps you could take my boring paper airplane design and you could you could bling it up a bit you could give it fins or you could adjust the weighting or talk about adding a paperclip or adjust the weighting so it's now open it's something that you can now riff off of as we've been saying in these tutorials and so was this easy or difficult I think you'll find that it's perhaps a little bit more difficult than you thought it was you know everyone knows what a paper airplane is everyone knows what PCR is everyone knows how to run a gel well maybe you have certain assumptions about what other people know how to do and those assumptions might not actually be correct and so where we're going again with this tutorial series is to making things as reproducible and automated as possible and so I found this video on youtube that I think is pretty cool that somebody programmatically engineered a way to fold the paper airplane right so you can imagine that people would that if you want a reproducible airplane this thing is going to give it to you I'm sure they could churn out 100 paper airplanes in a minute and they'd all be about the same and so that's really where we want to go with our data analysis so what I'd like you to do turning back to data analysis for microbiome data is think about a figure from the last paper you wrote or that was in your favorite paper what would you need to tell someone else how to create that figure so instead of making instructions for paper airplane now what instructions would be needed to make that last figure what would be the benefits of creating a script to tell someone how to generate that figure and what would be the challenges so again these are topics that we're going to be going over in the rest of the tutorial series but it's good to start to formulate ideas about your answers to these questions and how you would overcome them and how not being able to answer them or not being able to overcome them perhaps limits our reproducibility limits the ability of others to build off of our work so finally I have a little exercise for you where I've got on the left six different tools and on the right six different applications and I'm not going to do this as part of the video the instructions are there if you hit the P key or the answers are there if you hit the P key but I'd like you to match the different tools with the different applications well I really hope that you enjoyed thinking more about some simple first steps that you can take to improve the reproducibility of your data analysis as I mentioned in the last tutorial if you make it all the way through the tutorial series there will be an opportunity for you to receive a virtual badge indicating that you completed the material to receive that badge later activity will ask you to demonstrate that you have done the activities in each of the tutorials things like getting an orchid and github account and posting the repository that contains the instructions for folding your paper airplane today's motivation was thinking about incremental steps that you can take to improve the reproducibility of each analysis you work on that being said in the remaining sessions the topics are going to get a bit more technical and a bit more involved to be quite accessible even to the novice by the end of this tutorial series and with much practice you'll be what I call a full stack reproducible research data scientist