 Hi, my name is Pat Schloss. I'm a professor at the Department of Microbiology and Immunology here at the University of Michigan. I'm really excited about a new project that I've been working on that deals with reproducibility and how we can use various tools to improve the reproducibility of our analyses. These tutorials will be available at our website as a series of slide decks. Over the next few weeks, I'll be releasing videos that have me talking you through the slides and doing demos of the tools and the practices I discuss. The tutorials that I will be presenting I've previously taught at a variety of workshops. These workshops have gone pretty well and I've used the feedback from those sessions to remove and add contact to make them better. Like I said, I'm really excited to share them with you. If you have any comments, please don't hesitate to contact me or to leave a comment on the YouTube website. Perhaps we can circle back to some of these questions and comments in a future tutorial to answer any questions people might have. The title of this project is the Riffamonus Project. The name comes from the practice in music where one musician takes the theme from themselves or others to vary it, layer things on top of it and perhaps present it in a different context. Similarly, I hope to show that one of the benefits of focusing on making your research more reproducible is that if others can reproduce your work, they will be able to use your methods on their data or perhaps their methods on your data to help move science forward, if you will, scientific riffing. My hope is that we can approach reproducibility from a positive perspective rather than as a negative perspective. Instead of seeing research as being in a state of crisis where it is perceived that people are doing garbage science, perhaps instead we could see reproducibility in a positive way. We can appreciate the effort that scientists go through to make their research more reproducible and fodder for our own scientific riffing. The concepts and tools that I'm going to be talking about are pretty general and can be used in a variety of types of data analysis because I'm a microbiologist who's interested in the role that bacteria play in shaping human health and disease. I'm going to use examples throughout the series that are taken from microbial ecology and the human microbiome literature in particular, but I'll also use a series of goofy examples including folding paper airplanes, predicting people's age based on their names. There should be something here for everyone interested in making their data analyses more reproducible. Come along with me and let me show you the project's website where you can find the slides that we'll be using in this in this series and begin to introduce you to the content in the tutorial series that we'll be doing over the next few weeks. Wonderful. So the first part that I want to introduce you to about the Riffamonus project is our website and so you can get to the Riffamonus website by going to riffamonus.org. Here will be a launch pad for disseminating different instructional materials, different bits of information about how we can improve the reproducibility of our research in microbiology and eventually perhaps other fields as well. If you look up here at the top navigation bar, click on Training Modules. You'll see currently there are two different training modules in here. The first is a module called Minimal R and this is an R tutorial that I regularly teach from to help people to get up to speed with R, assuming that they know nothing as they start, but then also getting them going without overwhelming them with a lot of features and jargon off the bat, that it really is the minimal R and that people find over the course of doing several of these modules within the Minimal R series that very quickly they get up to speed on a diverse array of features within the R programming language. But what we're here to talk about is the reproducible research tutorial series and as it says here in the intro this is a series of tutorials on improving the reproducibility of data analysis for those doing microbial ecology research. Now the data set that we're going to be working with a lot through this series is from the human microbiome research. That's really not that relevant. What's important is that we're working with microbiome data, microbial ecology data, sequence data, data, any type of data, but data in particular that needs to be analyzed through a complex series of steps and we're going to think about how can we make these analyses more reproducible. Again, because my group develops the mother software package we're going to be using mother. It's not a requirement that you know mother or R, but it would certainly help as you move through these tutorials. I'm not going to be teaching you, R, or mother in this series of tutorials. Okay, but what we're going to learn are a series of practical tools, but also concepts and thinking about reproducibility and the factors that impede our ability to carry out reproducible research. And so we'll use tools like the bash command line, we'll use high-performance computing clusters, we'll talk about scripting languages like mother and R, we'll use a tool called version control, specifically git and github. We'll talk about automation using a tool called make and a concept of that has really been a big contributor to my research group, which is literate programming and using our markdown. And so again, these are things that my lab is using that I created these tutorials initially to onboard people coming into my research group. And so now my goal is to give them to you to help onboard you, so to speak, into the area of microbiome research and making it reproducible. Okay, and so before we launch into the initial tutorial, I want to call your attention down here to these dependencies and that a big pain in making analyses reproducible is what software are we using? Do we have the right versions of software? And we'll talk about all these things as we go along, but as we get into the computational aspects of this series, we will primarily be using Amazon web servers. The cost will be fairly minimal, but you might also want to try this on your local high-performance computing facility at your institution. Here at Michigan, we've got one called flux. You might have one at your institution. Generally at your home institution they're a bit cheaper than they are on Amazon. Regardless, once you get there you're going to need certain types of software. I mean, alternatively you could also run it on your laptop without going to one of these clusters, but I think in the long run it's going to be worth your effort to learn how to use these other computing resources like Amazon or your local computer cluster. And so if you're going to do it locally, or on your local high-performance computer cluster, you're going to need tools like R, Make, Git, WGet, and Atom or Nano installed. Part of my justification for using Amazon, for example, is that all these tools are generally already installed. And so if you're trying to kind of bring this up on your own laptop or your own computer, there's a bit of frustration in getting going there. Okay, so let's go ahead up here to the tutorials and let's click on this first link for the introduction. This is going to be the format of the slides I will be using in this tutorial. These are HTML based slides and if you want you can go ahead and hit the F button and that will give you a full screen. Also, as I say down here in the lower left corner, you can press the H key to open the help menu to help you identify how to navigate around the slides. And so one that you might find useful is hitting P, hit escape, to get out of that. And so by hitting P, it brings you to this presenter view. And so you might be familiar with this from, say, a tool like PowerPoint where on the left side we have the slides I'm going to talk about and on the right side are going to be the notes. And so if you want to follow along, you can see the notes that I'm using here. Okay, but I'm going to go out of the presenter view and I'm going to go ahead and open this up. All right, so let's get going. So the goals of this introductory session is to be relatively light and to help orient you in where we're going to go. And so I'm going to summarize the motivations for the series. I'm going to help you to understand where the tutorial is going and what we're going to get out of it. And then I'm going to give you some preliminary readings. There's a number of them, five or six papers, but none of them are super long and they're all meant to kind of provoke you into thinking about reproducibility and your own practices. And so I'd like you to read these before the next tutorial. So this whole session, this whole tutorial series, started as an April Fools joke for the mother software package. And the idea was I wanted to release a package called write.paper because I've gotten lots of emails from people saying basically write my own, write my paper for me. I should be able to give you an SFF or FASQ file and you should be able to pop out an analysis that does everything I needed to do. And I think we agree that that's kind of silly. What's really silly is that years after I posted this as an April Fools joke, I still got emails from people asking why doesn't write.paper work. So but this was a motivation for me and sure it was a joke at the time. But wrapped up into this was the question of well, could I write a command that would take raw data, maybe even go out to the web and pull down raw data, process it automatically, and then spit out a paper. And as much as we joked about this as an April Fools joke and people said, aha, good one Pat, that's really the goal of this workshop is to help you to write your own function, your own code, so that you can go to the command line and say write.paper and it will pull down your raw data, it will manipulate it, it'll analyze it, and it'll pop out a word file or a PDF that you can then submit to a journal or give to your PI or do whatever. And so again, as much as this started as an April Fools joke, it really evolved into thinking about repudicable research. I'm not going to say it's as easy as pounding out write.paper. You still have to tell the computer exactly what to do, but by telling the computer what to do, you're really telling your colleagues and potential collaborators around the world what to do when you analyze your data and what they can do to analyze your data. So as we go through this series of tutorials, there are going to be two reoccurring themes. First is thinking about your collaborators and who are your collaborators. So you, Mr. DiCaprio up here, is telling you, you are the most important collaborator and it's a special version of you. It's the version of you that exists six months from now and current you won't have email access in six months. I think we can all relate to that where we've had a project, we put it to the side for a few weeks or a few months, maybe even a few days, we come back to it and we're just like, where was I? Where was I going? So the second most important collaborator is your PI, your boss, right, that they need to know how you have been analyzing and working with your data. Someday you may leave the lab, you might graduate, you might get a job, you might go on to greener pastures and your PI is going to be stuck answering all the emails that come in and they need to understand what's going on. They need to understand how you've organized things, how you've done your analysis, why you did different things in your analysis and so you need to make these things transparent to your PI as much as possible. And of course, the third most important collaborator is your future collaborators, right, people who will read this paper and as I said earlier will want to riff off of the work you did to expand it for their own data sets or their own questions. The second theme that I want to focus on is that reproducible research is a positive thing. This is not meant to be gotcha science. Too much of the discussion about reproducibility has really turned into gotcha science where it's, aha, you know, the superman poses, it's not reproducible, it's not real, it's garbage science, it's junk. But instead to think about reproducible research practices as being preventative medicine, right, if we can make sure that our code is reproducible and that we're using good practices then we will prevent, we won't necessarily avoid future problems, but we'll prevent a lot of problems down the road because it will be easier to track down those problems if they occur, right, we'll be more transparent so others can help to improve the work we're doing. We'll have a better idea of what we're looking at months from now when we come back and scratch our heads and say, how did I make that figure again? Well, the tools that you're going to get out of this series of tutorials will help you to do that and in that way reproducible research methods are considered by many to be preventative medicine. So as I said earlier when we're looking at the website we'll cover a number of tools throughout this tutorial series and none of these are really specific to microbiome data. I have used these tools in a wide variety of projects I've worked on that have had nothing to do with nothing to do with microbiome data. So and these five areas really focus on some of the bigger themes that we're going to be diving into as we think about reproducibility. So the first is documentation and we're going to use a few tools like markdown, rmarkdown, rmake, get, thinking about organization. That organization is very important because if you can't find something then it's really hard to reproduce it. So we'll use tools like bash, we'll talk about using high performance computer clusters, we'll also talk about using get version control as a tool for helping us with organization. We'll also talk about automation. I don't know about you but whenever I get involved in manually curating steps all sorts of weird things start happening that I'm perhaps can't reproduce and so if we can automate things then we can overcome a lot of these problems and so to that we'll talk about using bash, r and make. Transparency is also a big issue when it comes to thinking about reproducible research, that we need to be transparent with our external collaborators, with our PI, with our self, with those out there in the world about how we did our work. And so tools like Orchid IDs, FigShare, other databases using git, github or open source licensing are all going to be really important as we think about improving the transparency of our data analysis to hopefully make it more collaborative. And so that other final issue of collaboration is really critical because I don't want to just do research that ends, begins and ends with me. I wanted to have an impact on others, right? I want people to think of my results as something that they can build off of. But also all the hard work I went through to process my data, to analyze my data, I want that to be valuable to others as well so that they can go off and take my tools that I've developed to analyze my data to analyze their data. There's not a, there aren't many feelings better than getting an email from somebody out of the blue that say, hey thanks for making your code available online. I've been struggling with this problem and I see that you solved it or you found a way to deal with it. And so now I've used it in my code and that's really helped me out. I mean that's what science is all about, right? Taking work from other people, riffing on it, expanding it, to advance our knowledge of the world around us. So as I mentioned earlier we'll use a couple tools that we're not going to go super deep on. They're important for thinking about scripting and analyzing data. But I deal with teaching these tools in other places and so I already talked about R and that we have a minimal R tutorial that if you're not familiar with R, I'd really encourage you to go through those materials. You don't need to be an R expert to make it through this series of tutorials but it'll certainly help. Now the second is mother. So mother is a software package that my lab has created and maintained for analyzing 16s RNA gene sequences. Again it's not critical that you know mother. We're going to do some copying and pasting with mother commands from the the mother MySeq SOP. It might be worth your while to at least once go through that SOP documentation using the data that I provide in that tutorial. Okay so again it's useful to know R and mother but it's not critical. So as I mentioned earlier there's there's five papers that I'd like you to take a look at. They're not they're not very long. They're typically maybe two or three pages long and so the first is by Francis Collins and Tabak from NIH. Francis Collins is the director of NIH and Tabak is his deputy and it describes their plans to enhance reproducibility. Second is a editorial by Arturo Casadoval and others which reports a framework for improving the quality of research in the biological sciences. This is an outgrowth of an American Academy for Microbiology report dealing with reproducibility in microbiology and so it's a it's a very relevant document for those of us in microbiology because this tells us what other microbiologists generally senior microbiologists are thinking about reproducibility in our field. Next is an editorial by Jacques Ravel and Eric Womack published in microbiome published several years ago titled All Hail Reproducibility in Microbiome Research and here they lay out several papers that had recently been published that were making use of reproducibility tools and so it's a a very useful paper for thinking again about people within our field and seeing how they're thinking about reproducibility. The fourth paper is is more nuts and bolts and practical paper by Noble which is a quick guide to organizing computational biology projects perhaps the most boring title ever it sounds like the worst paper ever but you read it and it's really rich and really gets you thinking about how we organize our projects and these are concepts that we'll come back to later in the workshop and finally a paper that I think is is just awesome for its humility is a paper by Garigio et al out of the Bourne lab. Bourne is a leader in bioinformatics he was I think some some type of director at NIH and so they published a paper looking at the the tuberculosis drugome several years ago and then Bourne issued a challenge and said I'd love for people to come back and try to reproduce the work we did and so this manuscript, Garigio et al, is a reporting of what happened when people went back and tried to reproduce their work. I think a lot of us cringe at the prospect of thinking of somebody coming back and reproducing our work but again tip my hat to Bourne and his colleagues for for doing the experiment and seeing where the bottlenecks were and so we'll come back and talk about all of these papers as we go through the future tutorials and so it'd be great right now before the next tutorial to to read these to get a sense again of where we are in science in general and microbiology in particular. Okay so here's some homework for you to think about and I think these would be great discussion items for your next lab meeting or if you're interested in the area of research ethics or I mean just your own research. These are questions that we all grapple with I think. Is that what about your current data analysis pipelines makes you feel uncomfortable? Perhaps me talking here has maybe dusted off a few recesses of your memory that make you a little bit stressed out. Are there areas of how you do data analysis, the reproducibility of that data analysis that that make you a little bit nervous? So the second question then is if you left your project for six months how difficult would it be to restart? I have a lot of friends in ecology and they commonly go off for fieldwork for maybe months at a time and they're in remote areas where they can't work on a computer and they come back. How do they get going again? How difficult would that be for you? How difficult would it be for you if you go to a conference for a week and come back and need to get going again? And finally how easy is your PI able to interact with your analysis? Are they truly a collaborator or are they a receiver of the analysis and the text you give them? Is there an interaction and is it possible for them to interact with you? Maybe you don't want them to but still the question needs to be asked of how easily is your PI able to interact with your analysis? And there's no good or right answers to these questions. They're questions to motivate your thinking and get you to start to grapple with issues that we're going to be encountering as it comes to working with data and thinking about how we can make our data more reproducible and hopefully more robust. Well thanks again for hanging out with me as I introduced the Rifomonas project in this reproducible research tutorial series. Like I said it's a project I'm really excited to share with you and to be able to show you how my lab and I strive to make our data analyses more reproducible. It will seem like a lot of content over the next few weeks. What I want to impress upon you however is that it's important and okay to take incremental steps to making your work gradually more reproducible. I think it's unreasonable to expect you to incorporate everything we talk about over these next series of tutorials into your first project or that you should somehow go back and retrofit current or previous projects and said I'll be giving you tools that you can gradually bring to your projects to improve your overall practices over multiple projects. The next tutorial will define reproducibility and replicability and explore the factors that impede reproducibility and replicability of research. The content of the next tutorial won't be too technical. My goal is to ease us into the more sophisticated and technical aspects of how we make our research more reproducible. A lot of the problems we face with reproducibility frankly are human problems where our own behaviors and practices get in the way of our progress. I think you and the rest of your lab including your PI will get a lot out of the next tutorial even if they aren't actively doing data analysis. So so long and I hope you'll join me next time as we dive deeper into this series of tutorials on making our data analyses more reproducible.