 Hi there and welcome back for the final lesson in the Reproducible Research Tutorial series. I hope you enjoyed the last two lessons on how to use the tools we've been discussing in this series to improve the documentation of how you collaborate with yourself and with others. If you haven't already filed your poll request, I'd love to see yours so that we can add you to the Reproducible Research Honor Rule. As we've discussed throughout this series, documentation is a critical tool to maintaining the reproducibility of our analyses. We saw this in discussing project organization, the use of version control to map the history and development of your project, the use of readme files to provide information to people interested in our project, and literate programming tools that allow us to blend code with narrative documentation. The theme of transparency is right there along the way in our discussions with documentation. Sure you can provide documentation for yourself and current and future members of your research group, but what about people outside your collaborative network who want to see what you've done? Today as we finish this series, we'll talk about the importance of openness in doing science. This theme of transparency will impact how widely our research and the work that went into our paper is spread. If we operate in a closed manner, then the work will be unlikely to spread as far as if we operate in an open manner. This idea of openness touches on everything, from making our data and code accessible to publishing in journals that anyone can access. Do you want to clutch your data like someone might clutch their pearls? Probably not. Or do you want to release your data and code for others to incorporate into their own analysis or tariff upon? Hopefully by now you can anticipate what position I would take. Join me now in opening the slides for today's tutorial, which you can find within the reproducible research tutorial series on the riffamonas.org website. Before we start talking about open science, we're going to do one of our pop quizzes, and so recalling the last lesson, can you recall why it's called a pull request rather than a push request? Why is it even called a pull request at all? If you might remember, what we're doing is we're asking another developer to bring our code into their code, and so we're asking them to pull from our repository into theirs. That's why it's called a pull request. And remember the nice thing about GitHub is that we can work in the open, we can develop, and we can solicit feedback from other people, but we can limit who can directly contribute to our repository. And so if somebody doesn't have permissions to contribute to our repository, then they need to file a pull request. And also we might work in a way where we don't let ourselves push to the repository either. We might require ourselves to file a pull request, and again in some research groups they might require you to have another pair of eyes look at the code before that pull request is accepted. Second question, so we're at the end of the series, we're not going to be going back into our Koza tree analysis or logging into the Amazon image today. So what do we need to do now? Well, if you recall from what we've been doing is that when we're done with the day's work, we'll use stop to stop the instance, and stop maintains the hard drive so to speak. So all the files are there when we do stop. If we do terminate, then that deletes the entire instance. And so that's perhaps where we're at now, and we could of course use FileZilla to download all of our files from Amazon to our local hard drive. There are other services at Amazon, things like S3 or Glacier or other ways to store data. You might decide to transfer your data to one of those services. And once you've done that, be sure to do a final push of all your repositories up to GitHub so that you've got your code backed up. And then you can literally burn the whole thing down as we've been saying using the terminate instance option in the EC2 dashboard. So for today, we're going to talk about the value of doing open science and we're going to identify the steps that you can take to maximize the discoverability and ultimately riffability, so to speak, of your work. And so as a case study, if you saw the Google Doodle from the title slide of this talk, there's a case study involving Antonine von Landhoek and his paper concerning little animals. And so this is a really seminal paper in microbiology because he developed a microscope and precise methods of grinding the lenses to see things that nobody had seen before. But his description of these quote unquote little animals aroused a lot of suspicion. People didn't totally buy the idea of the inimicals. And why was this? Well, he wrote in Low Dutch, which had to be translated to English and it was heavily edited by someone else. And so it's kind of like the game of telephone where information is lost along the way. He also refused to give any descriptions of his methods. He didn't have very good method sections and he didn't want to teach people his methods of grinding lenses. And so no one did. And so you could imagine that this would be hard to reproduce his work if the author isn't providing information about how the work was done. So after he submitted this hook and Huygens and others were unable to reproduce his work, it would be considered not reproducible. Eventually hook was able to validate the results using a compound microscope. And because he made his methods open, he then popularized the compound microscope, which were probably not as good as Landhook's microscopes. Just the resolution and the quality of the lenses were inferior to what Landhook made. But again, because Landhook didn't make things open, they were forgotten because no one could replicate what he did. No one could reproduce what he had done. And so here are images of the two microscopes on the left. Is the Landhook microscope and on the right is the compound hook microscope, which lives with us even to today. And I think we want to ask ourselves, do we want to be on Team Landhook or Team Hook? Both of them are fantastic scientists, leading minds of all scientific history, really. Both of them did rigorous and strong science. But Landhook's just wasn't reproducible, not because it was wrong, but because he wasn't transparent, because he wasn't open, whereas Hook was. And he made his methods accessible. And so his method took over, even if the method perhaps isn't as good as what Landhook's microscope was. So the punchline here is that science done in the open has a bigger impact. There are studies showing that open papers are better cited. They're easier for others to build upon. And it builds transparency, which we've said is a critical component of reproducible research. So we have a variety of tools to maximize the openness of our work. Things like licenses and copyrights. And so sometimes we think of license and copyrights as restricting things instead of opening things. But we'll talk about how they open things. We can use public versus private repositories on GitHub. We can use files in our repository, like a citation or a contributing file. We can make our repositories more discoverable by advertising them in our papers or by giving them tags that allows other people on GitHub to see them. We can post our manuscripts as preprints. And we can publish in open access journals. So I'm going to go through each of these bullets and briefly discuss how we can use these tools to maximize the openness and hopefully reproducibility of our work. So license and copyrights, this oftentimes gets very confusing. I am not a lawyer. I am not trying to provide legal advice. I'm trying to distill what I've read from a number of different sources and what my own research group does. So if a license or copyright are not provided with a code or text, it's presumed to be a closed source with all rights remaining to the author. So it is not open. So if you don't see a license in somebody's repository, that is closed source. It is not open for others to use, no matter if it is a public repository. So it's important to have a license and to state your copyright limitations or provisions so that it's explicit what you intend. So don't write a license from scratch unless you're a lawyer. But even then, why? There are numerous options available. Admittedly, this can get very confusing. Again, these are some of the resources and these links that I've consulted as I think about licensing and what I'm trying to do with our academic contributions. So some legalese to get some definitions in here. So a copyright declares and proves who owns the intellectual property. So whether it's the code, the text, whatever. And this exists without you actually asserting it. So if you write a blog post or you write a paper, the copyright is yours. Usually when you write a paper, if you lose the copyright, it's because you sign those copyright privileges over to, say, the journal or to someone else. But if you produce something, you own the copyright. You don't have to assert it, but still it's good to assert it explicitly. Licensing, on the other hand, describes the terms under which people are allowed to use the copyright written material, and you have to grant it to others. And so again, if you haven't granted it to others, then they don't have it, that it's closed, that it's yours. So some important points. If you use other people's code in your code, and it has the general public license or the GPL, which is very common, then your code must be licensed with the GPL as well. That's one of the provisions of the GPL. Oftentimes people say, well, I should maintain all my rights, and I shouldn't have a permissive license on my repository because I could get money off of it. That is wrong. You are highly unlikely to make any money off of your code or muster of your academic production. I'm sorry to be the bearer of bad news. Mother is a highly cited, highly used software package. But even that, I think I would have had significantly fewer citations if I would have charged people for it. And I've gotten far more in grant funding to support mother than I would have gotten in commercial income. So in that being said, a closed or restrictive license is more likely to reduce the number of citations than a more open license. So the more freedom and flexibility you give people, the more they're going to use. A common license that people use for text and for other creative materials is the Creative Common license. Sometimes you'll see it as CCBY. This does not cover source code. But if you look at manuscripts or papers, a lot of papers that are, say, open axels, like you might find in the journal Mbio, are licensed under CCBY, which means that it's a Creative Commons license in BY means that it's BY attribution, and that people can use it to do whatever they want, but they have to say that they got the material from you. So they have to provide attribution. But again, it does not cover source code. So we could not license mother or the code for our last paper using CCBY because it doesn't cover source code. So for code, we generally use or try to use the MIT license. It requires that all copies of the license software include a copy of the license and the copyright notice. Again, we still retain the copyrights, but it's fairly permissive and allows people to do whatever they want with it. But again, they have to include my copyright information about the code in anything they do. Papers and preprints will generally use the CCBY license. These require attribution, and it allows others to do whatever they want with the material. So they could take a figure from your last paper and put it into their paper or their book, or they could make a big poster, and they could sell copies of it and make money. So those are all allowed by the CCBY license. There are more stringent or restrictive CCBY licenses that say forbid commercial use or things like that. But again, in general, I find that it's far more rewarding and beneficial to the original author to be as permissive as possible. You will still receive attribution if you put it under one of these licenses. I think oftentimes people say, all rights reserved is the preferred license. But again, that limits what other people can do with it, because that basically says, no, you can't use my work to do derivative products, say, to take a plot and to refashion it or repackage it. And it restricts what people can ultimately do. So one of the CCBY type licenses is really the way to go for papers, preprints, and other textual type research products that you might make. MIT or, say, GPL are the license to generally be using for your code. And so if you look at our repository for the COSIT tree analysis paper, you'll notice that there's a license.md file in it. And this is what the text says. This is the MIT license. This is the entire license. And so it says, the above copyright notice and this permission notice shall be included in all copies or substantial portions of the software. And so this is the license. If you wanted to edit the license to, say, add your PI, add other names of co-authors, change the copyright year, you could click on the pencil to do that. You can also change the actual license. Again, this is the md file. You can, of course, do this. We could have done this in our Amazon instance as well, but GitHub makes it nice, because you could also choose a different license template by clicking on this button. Doing that then brings you to a list of a large number of different licenses you want. They also have information about what types of licenses to be using. Again, what I mentioned were the GNU General Public License Version 3 or the MIT license. And so again, you could pick one of these licenses and it will update the license in your repository. And if you change that, then be sure you do a get pull to pull the most recent version of the license down to your local version of the repository. So on the issue of repositories, we've been working in public. If you are an academic user and you contact GitHub through this link, GitHub will give you unlimited number of private repositories. But again, the private repositories allow you to restrict who can see your work, including you could restrict it so only you can see your work. So it's OK to work with a private repository, but you need to eventually make it public before you submit it. I think these practices might vary by research group and even within a research group. So I'm totally fine with my repositories all being public from the day I start to the day I publish the paper. Other people in my research group want them to be private until we submit. I'm fine with that. That's their choice. I think that's fine. Also, all the mother development is on GitHub, and that's also all worked on in public. So you could get the bleeding edge of mother by going to a repository, downloading it, and compiling it yourself. You can see all the changes we're making in real time. I generally think people have better time to do than to spy or snoop or just be entertained by watching my projects develop on GitHub. So I think the odds of anyone scooping you if you work in public are very low. It's probably more likely that you'll get unsolicited help. I was working on a paper once where I got a pull request to edit some of my manuscript. I was kind of blown away that anyone cared, but also it was kind of funny because I wasn't even through with the first draft of the manuscript. So that was cool that somebody did that. And that's fine, I think, to get that unsolicited help. I just told them that thanks for the input. I'm still working through it, but I'll be sure to incorporate their comments as I go through this. Another thing that we can use is a citation file that we put in the root of our GitHub repository. This will then tell people how to cite our work. And so we can, and for example, for the Kozich one, we could say, you know, if you, to reference this paper in publications, please cite the following. This would be one way to cite this, and then this would be the bib tech format that we talked about when we were talking about literate programming tools that they could copy and paste this into their reference.bib file, and then they could cite Kozich 2013 in their paper. Again, it's another way to tell people how they can cite your material. You're being proactive in telling people how you want your work to be recognized. We talked about this a bit in a previous tutorial, but you can also put in a contributing file into your GitHub repository. This will put a link in to your issues page, as well as for pull requests, telling people the standards that you expect them to follow in terms of behavior and coding. It also signals that you're looking for people to contribute to your code. So if you don't want people to contribute to your code, get rid of this file or put something in there and say, hey, I'm excited that people would want to contribute to this, but for the time being, let me finish the project. And if you're interested, email me or something before you go through the effort of making some type of contribution. But again, it's a proactive way of telling people you're interested in engaging the community to make your work better. This is not required. Like I said, with our own work, we rarely get any feedback from anybody before we submit the paper, even though we tend to work in the open or at least I tend to work in the open. But again, it's a signal, it's like the bat signal, it says, hey, we need help, we're looking for help. We'd love to have you contribute to what we're doing here. You can also go out of your other ways to make your repository discoverable. In your manuscript, you can put the URL to the GitHub page and this will point people to the actual reproducible version of your manuscript. You could obtain a DOI, a digital object identifier for your repository. Zenodo is a service that will do this. It'll interact with GitHub to create a permanent link to your repository. This limits the risk of LinkRot, as we talked about in one of the first tutorials. You can version your DOIs so people can link to a specific instance of your repository. So say you're working on this over time and I wanna cite your work, I could cite one version of the repository with a specific DOI even though you continue to work on it, putting out new versions of the repository or the project under different DOIs. So then the repository and not just the paper becomes citable and those citations can be tracked. And so again, that we see the repository as kind of the continuation of the paper as well as the source of the paper, right? And so we have this kind of living project that continues on. But again, having that citation file will help people to know how you want to be cited. Do you want us to cite the paper? Do you want us to cite the repository? Do you want us to cite both? Finally, a way that we also talked about making your repository discoverable in an earlier lesson was to tag your repository. And to do this, if you look up at the top of your page, you can click Add Topics and you can type in reproducible-paper and this will create a tag that then is linked to your paper. And so that becomes that. And you could add other tags. You know, I could add a tag for MySeq, I could add a tag for 16S, RNA gene sequences, whatever, it's a way that people can now click on reproducible paper and they could find other repositories that people have created where they are trying to be reproducible in their methods and using methods a lot like what we've been talking about in this tutorial series. So moving from the code to talking about how we disseminate our resources, one recent tool that's come online in the last few years that's really been growing in popularity is the area of preprints. So in the traditional peer review process, your work is seen by maybe two reviewers and an editor. And so they might spend a couple hours each on your manuscript and they're not really, the process tends to be more adversarial than helpful. They're gonna tell you a laundry list of things that are wrong with your paper and very few things that are good about your paper. It also tends to be quite secretive. You generally don't know who your reviewers are and sometimes you don't really know who the editor is unless your paper's finally accepted and published. And so it's very adversarial, very secretive, and it can be quite slow, right? That you submit this paper to a journal and even though they tell reviewers you have two weeks to review it, it might take two months to get the reviews back. And meanwhile, you've got students or postdocs who wanna kinda get on with their careers and they're trying to find jobs and they need these papers to indicate that they've been productive while they've been in a research group. And so it's kinda slow. Alternatively, peer preprints provide a supplement to peer review. They're open for anyone to see. They foster high engagement. So, if they're out there, I post my preprint. Anyone can see it. I can advertise it on Twitter, on Facebook, wherever. I can email it to friends or people in my network of science. And I can engage people. There's a discussion forum for each paper where you can leave comments. And then the authors can come back and rebut or follow up on your comments. I've left comments on a number of papers and although I always fear that the authors are gonna be mad that I've commented on their manuscript, in general, they're all very happy to see that someone cared enough to provide a review. And that they all, at least the feedback I've gotten, is that I've helped to make their papers better. This then becomes collaborative. So instead of being adversarial, we're trying to break things down. It's collaborative because I'm really trying to help their paper. And because I'm doing it, and I'm also doing it in a public way, and so perhaps the level of snark or vindictiveness that goes into a typical peer reviewed review is much less because it's gonna be public. You can see all my reviews on BioArchive. It's also immediate. I post a paper to BioArchive and it's up within 24 hours. You post a comment, I'm gonna see it within 24 hours. There is some curation that's done to make sure that people aren't posting garbage and that people aren't posting poor reviews. But it's a very immediate process. And I really have to emphasize the collaborative nature and much more forward-looking nature of preprints than the traditional approach. The traditional approach, you're responding to what people said so that you can get the paper accepted whereas the preprints is much more forward-looking and you're trying to make the papers better. My research group, when we've done journal clubs, have occasionally picked a preprint and will assemble all our comments and then someone will take the job of making sure that the comments flow together and a good review. I'll generally work with that person to kind of help them learn about how peer review works and then I'll post them under my name in case people are worried about retribution. And again, I have not seen any cases of retribution for comments left on preprints. And I think that's really one of the benefits of preprints is that people are posting their science in a preliminary form, perhaps weeks before or maybe at the same time as submitting it to a journal. And we can give substantial and meaningful help and feedback to those authors about what we thought. And it's always exciting to see new preprints come out because it's very much the bleeding edge of science. And so in 2017, I wrote a paper that was published in MBIO called Pre-Printing Microbiology. And I went through and looked at the types of papers that had been submitted and posted to BioArchive. And you can see that the rate here is growing. In the last year, it's grown even more so. I need to update this slide a bit. If we look at the altmetric impact scores for papers published in MBIO to those posted on BioArchive, we see that papers posted to BioArchive are having a higher altmetric impact score than those in the red here for MBIO. The altmetric impact score is quantifying non-citation-related metrics or data. So it's like, how many people have blogged about it? How many times has it shown up in social media? Has it shown up on Wikipedia? These types of things. And also, is it being picked up by the media? And so things in BioArchive, you could argue, at least by this score, are having a bigger splash than things published in MBIO over the similar period of time. So if we looked at the papers between 2014 and 2015, that were posted to BioArchive were published in MBIO. And so there were about 155 papers that were posted to BioArchive and then published in 2014 and 15. To the 851 that were published in MBIO, the BioArchive papers actually get a few more citations on average than the MBIO papers, which is intriguing. I mean, I wouldn't read too much into that, but I would hope that these data would indicate that the preprints on BioArchive can really hold their own and that these are solid pieces of scholarship that are contributing in having an impact on the field. I have a colleague here at Michigan who has a paper that's been cited several times. It's a preprint that's been cited several times and they're still struggling to get the damn thing published, right? And so if you feel like there's something wrong with our publication model, that it's closed, that it's adversarial, that it's slow, well, think about preprints. Again, it's a way to improve your transparency, to put your work out there in a still somewhat preliminary state before it's gone through peer review to get feedback from people on what works, what doesn't work, what could be improved, ideas for other analyses that people might add, and this is very powerful. Finally, we can publish in open access journals. So it's been shown that papers published with an open access model have higher accessibility, citation, likelihood of being built upon by others. If your paper's in a journal that I can't get access to, well, I'm sorry, but I'm not going to cite it because I can't read it. Not everybody has a library that's flush with funding. Many libraries are really under the screws in terms of their budgets being cut. And so they're cutting subscriptions to these very expensive journals. And so the model is then flipped in an open access model whereas traditionally we ask the readers to pay for the paper. Under the open access model, we ask the authors to pay for the publication. If you receive NIH funding and funding from a variety of other sources, there must be an open access version posted. And so it's either by publishing in an open access journal or with NIH. You publish to PubMed Central where they keep like a PDF version of either the original manuscript you submitted or the published version of the manuscript. There's also a variety of models that journals are using to make their work open access. So there are journals that are entirely open access. So a journal in microbiology like M-Bio or M-Sphere are open access through and through everything they publish is open access. American Society for Microbiology also has journals like Applied Environmental Microbiology or Infection and Immunity that are on a subscription model but you can pay extra to have your paper be published as open access. Under this open access, you retain the copyright. The journal does not own the copyrights. But you have to keep in mind this can also be quite expensive. So if you publish an open access in an ASM journal, it's gonna cost you about $2,300 if you're a member. If you wanna publish in nature communications, it can be $5,200, right? So it's an illustration of how silly the non-open access model has gotten. Here is the nature page for the classic Watson and Crick paper describing the structure of DNA. Published in 1953, if you wanna buy it, if you wanna read it, you're gonna have to pay 20 bucks to see it, okay? This is a paper that's 65 years old and they're still charging people 20 bucks to read this paper. Again, under an open access model, this would not be a thing, right? That Watson and Crick would have paid some amount of money back in 1953 and people in perpetuity could read their paper. Now, this is obviously a topic and a paper that had earth-shattering ramifications. But if you have a paper that you want people to read and you're afraid they're not gonna read it, well, making them pay for it is not going to encourage them to read your paper. And getting it out there under an open access model as much as possible is really going to help the spread of your ideas. And again, the transparency and rigor with which you've done your work. So finally, I have a series of questions and exercises that I'd like you to think about. I'm not gonna provide answers to you, so you're gonna have to come up with answers on your own and some of them don't really have perfect answers either. So within your research group, if you look at the last three papers, who owned the copyright? So go find the papers. What was the copyright? Who owns the copyright? Is it you? Is it the publisher? Was it a CCBY? Was it some other copyright that the journal owns? I want you to go out and find the licenses used for R, MOTHER, and CHIME. And compare and contrast those different licenses. It'd be great to have a discussion with your PI and the rest of your lab regarding which software license you think best fits your projects. Should you be using the MIT? Should you be using the GPL? Is there a reason why you think you shouldn't have a license? Do your PI and your research group have a preference for whether to publish your research under an open access, open source license, or under a closed license? What is the reasoning behind that? Is it purely economic? Or is there something deeper, something else? I'd like you to also read and discuss the ideas put forth in my paper, pre-printing microbiology at one of your research group's lab meetings. So what would it take to submit your next paper as a pre-print if your group already isn't already doing that? And if you've already posted a pre-print, what was your experience like? How can you improve that experience? How could you perhaps get more people engaged in reading and commenting on your pre-print? Throughout this tutorial series, we've talked a lot about principles and tools that we can rely upon to make our research more reproducible. Today's material is far more about setting a tone in our scientific culture that fosters openness. I know that it's easy for many people who advocate for open science to take on a bit of a self-righteous tone. That tone can be very off-putting for most people. We need to keep in mind that it's a very real fact that publishing in open access journals is more expensive than publishing in journals that depend on subscription fees for the revenue. Also, a lot of these concepts that I've discussed, things like data and code release, pre-prints, require researchers to break from a traditional model where our data and our code were our property and we must protect ourselves and the data from being scooped. What I'm encouraging you to do instead is to be transparent so that others can stand on your shoulders to see further. In my experience, I have benefited far more for being open than I have benefited by being protective or paranoid. Of course, you might be working in someone else's lab and you don't get to set the policies for the research group. The PI or your collaborators may not be open to the level of transparency that I'm advocating for. Unfortunately, that's just something you're gonna have to navigate. As I said in the first lessons of this series, don't feel like the material in this series is an all or nothing proposition. It's okay to go slow and to add one or two elements that we've talked about with each study. This will hopefully be a strategy that your PI and collaborators can also support. That's also the message that I want to leave you with. There's been a lot baked into this series of lessons. Don't feel like you have to do it all at once. Pick off a few things that you can do on the project you're working on right now. Then with the next project, pick off a few more things. Before you know it, you'll be right where you wanna be. Finally, I really appreciate you sticking with this series all the way to the end. There's been a lot of the content here. I've really had a lot of fun developing these materials and rolling them out to you as videos and slide decks. Feel free to let me know what you liked and what you could use more of. I hope to create a few other video series that show how I use these ideas and perhaps spotlight new tools that we can use to implement the ideas I've covered in this series. Tools facilitating reproducibility are coming at us rather quickly and they always strive to make our lives easier with fostering reproducibility. Be sure to get me those pull requests and I'll talk to you soon.