 Reproducibility is important in science for so many reasons. One of the reasons that I think is totally undersold is the ability to reproduce our work across projects. What do I mean by this? Well, stay tuned for the rest of today's Code Club and I'll show you what I mean. Hey folks, I'm Pat Schloss. Reproducibility can mean many things to many people. On one end, if we take the same code or the same methods and apply it to the same system, we should get the same result, right? That's a basic idea of reproducibility. Another approach to reproducibility that I'm really fond of is the ability to take the same code and apply it to different systems to not get the same result, but to get the same process, the same processing of the data, the same visualization of the data, the same analysis of the data across those different data sets. To me, that's a win that really improves our ability to replicate our analyses and do the same type of analyses in multiple different settings. Also, it is the basis of the name of this channel, which is Rifomonis. The idea of Rifomonis comes from the idea of riffing, where we might take a theme, say in music, and replicate it across different settings with slight different perturbations. If you're a fan of Hamilton, the musical, perhaps you're familiar with the counting sequence from one to 10. That counting sequence is used maybe three or four times across the musical each time with a subtle change, and that if you think about it in kind of the context, it tells a different story each time it's used. Well, I'm not making Hamilton here. I'm sorry. I'm not Lin-Manuel, but I can make my life easier by taking the analysis I apply for one project and using it into another. Obviously, I think this also says a lot about how we can learn to program and that I can learn to program by taking a code chunk that somebody else has developed. I can adapt it to my use and then modify it to make it better or to highlight the unique aspects of what I'm doing. That is how I learned how to program and that is how I try to teach people to learn how to program themselves. So if you've been following along with recent episodes of Code Club, you know that we've been building an ordination using different packages, different approaches to a data set that was developed by a former graduate student in my lab who's now Dr. Alex Schubert. We've taken that. We've improved it. We made it look nicer. And what I'd like to do now is see if we can take that same coding approach and apply it to another data set that was made by another former student in my lab, Dr. Neil Baxter. The ordination we've been working with looks at the variation in the gut microbiota of people without diarrhea, those with diarrhea and those with diarrhea and who are also positive for Clostridioides difficile. The new data set I want to work with from Neil Baxter's papers looks at the variation in the gut microbiota of people with normal colons, those with adenomas in their colons, and those with carcinomas in their colons. My goal for today's episode is to take what we've done for the Schubert data and modify it for the Baxter data. Looking in my raw data directory, you'll see that I have the same files for the Baxter data set that I had for the Schubert data set. Also, I have Schubert adonis.r and Schubert NMDS.r. I'm going to start by renaming these to be Baxter adonis.r and Baxter NMDS.r. And I will go ahead and open this up in our studio. And if I go ahead and open up these scripts, one thing that I see right off the bat is that I'm sourcing in Baxter NMDS.r Schubert adonis.r. So let's go ahead and change that to Baxter adonis.r. I'm going to go ahead and source the whole thing. Great. So this created Schubert NMDS.tiff. That is the visual that we expected to see. It didn't fail on anything when we ran the adonis r script. You might recall that we had a stop if not in there if any of the comparisons were not significant. So that looks good. So if we start in Baxter adonis.r, I'm going to go ahead and change Schubert to Baxter throughout. I think there's only a couple of places where I used Schubert. I guess I can go ahead and do a search on Schubert. And I think I got them all. Okay. Something that occurs to me is that not only are the file names different, but the column names in my metadata file are probably going to be different than what I had in Schubert. And certainly the diagnosis or the treatment groups are going to be called different things between the two different data sets. So I'm going to want to update this in adonis.r as well as in NMDS.r. Let me start by running these first 11 lines of code. And I can then look at metadata to see what the columns are called. Let me make this a little bit bigger so you can see. And the column I'm going to want is this dx column, which is short for diagnosis. And the other thing is that I've got sample here and sample instead of sample ID. So that's good. And again, wherever I have sample ID, I'm going to go ahead and find that. And I'm going to replace it with sample. And so I can replace all those. It's a little bit dangerous to do find all replace all, but when we'll run it, we'll let us know if we screwed something up and kind of scrolling down through here as well. I see disease stat. That's what we had before. And we want to replace that with dx. So if I do disease stat, I can then replace that with dx. And I'll replace all of those. And there's 10 of those kind of scrolling down through here. We get down into the different diagnosis groups. And we're going to have normal adenoma and cancer. And so I guess wherever I have case, I'm going to call that cancer. So I'll replace all those. And then wherever I have non diaryl control, I'll put normal. And then wherever I have diaryl control, I'll put adenoma. It's kind of the same stepwise progression. So I found those four. Let's go ahead and run all this and see what happens, see if anything blows up. So that went through well. Let me look at meta distance. Again, what we're going to see in meta distance is that we've got our meta data. And then we've got our different patient IDs, sample IDs. I can then run all distance. So I have an error here that it can't subset the columns that don't exist. Those columns do exist. I think the problem, though, is that when I read in the distances, it read in those columns, the column of sample IDs as a double. And so again, if I look at the default, it read in as double. And if I look at distance, I see that this first column, which is my sample ID is a double, I'll go ahead and modify this. So I'll add call type types calls. And I will do so that first one is x one is going to be call character. And that needs to be a function. And then I'll do that default equals call double. And that should work. And so now if I look at distance, distance, not distances, I see that my first column is of type call character. And now if I run these, so I'm getting an error that I can't join these two data frames, because sample in metadata is of type double and in in the distance matrix is now type character. So those need to be the same. Read Excel doesn't have the same interface for defining column types. So what I'm going to do instead is to do a mutate. I'm going to mutate sample to be as dot character on sample. And that so then that should make metadata. That first column now should be of type character, which it is. And we can now do the inner join in life is good. So again, as we're using different data sets, we know that we have different structures to the data. And so it does take a little bit of hand holding of the data to get it to where we need it to be. And so now if we do all distance, that works. No error messages. That's good. All test is running. And if we then look at all test, and here I'll make this window a little bit bigger so we can see it, is that we do have a significant p value point zero one. And so now we are okay to proceed on to the next step. And again, if we look at all p, we should get all p, our p value. We now can also look at our different diagnosis groups and the counts there. I don't really know why I have that line there. I'll go ahead and remove that. I need to define my pairwise p. And now what I'm doing is renaming things. So this shouldn't be cancer diarrhea, it should be cancer adenoma. And what we're doing here is we're filtering the data frame to get those rows. Someone emailed me to ask if we need to make sure for adonis that our rows in our distance matrix rows and columns in our distance matrix are the same as in the metadata. Yes, absolutely we do. And so in this step, we are doing that. And that's why we did the join of the metadata and the distance matrix so that we have everything in the correct order. And then we're refiltering things and then pulling things right back out of a part again. So I need to go ahead and change that diarrhea to adenoma here as well. And we are sampling that out. And this again will be adenoma. So update these run that. And then this again will be adenoma. And that should be good. And so now if we will look at pairwise p, we see that it's 0.05. So it's probably not going to be significant when we correct for multiple comparisons. All right, so we have cancer health. I'm going to go ahead and change health to be normal. A fun discussion item for a lab meeting or discuss broader journal club might be what's the difference between healthy and normal? Are they the same thing? No, they're not. All right, so we'll run that. And then we look at cancer normal test. And then pairwise p cancer normal. And so here we now see that the p value for cancer adenoma is 0.05 cancer normal, it's 0.01. And now we want to look at these. And I'm going to go ahead and change diarrhea to adenoma. Hopefully this doesn't screw all sorts of things up. Think we should be okay. And run the test. And now we look at pairwise p. And we see that adenoma to normal is not significant. Cancer to adenoma is 0.05 cancer to normal is 0.01. And we can then recall that we did this p adjust on all of the values of pairwise p run that. And we see that sure enough, cancer to normal is significantly different, but cancer to adenoma is not significant. And so this will yell at us, right? It will stop. So I'm going to now modify this to be a I'll create a variable called significant. And I'll do that p adjust less than 0.05. And I will then do so significant will then be false true false. And I'll do a stop if not. Let's see. Let's do stop if not cancer normal. Or it'll be significant. And then square brace, quote, significant normal, that. So that's good. And then stop if not significant cancer adenoma. And it'll also then be adenoma normal. And I'm going to do not significant, not significant, I'll slap an ampersand in the middle there. Because what not significant cancer adenoma will give me is true. And not significant adenoma normal will give me is true. And so then both of those with the and will be true if they're both true. Okay. And so then we can do stop if not. And now we're good. So what this analysis has told us is that the difference between cancer and normal is significantly different. But otherwise, normal adenoma aren't different. Normal adenoma and adenoma and cancer are also not different. So we'll go ahead and save that. And let's go ahead and return to Baxter and mds.r. And I will run that script, just so we know that it all works. What we're seeing as we work through that Baxter adonis.r script is again, as we use different data, perhaps different metadata configurations, we need to update for different columns, column names, data types, values in those columns, right? So we don't have non-diarrheal control, diarrheal control in case we now have normal adenoma and cancer. And so we have to update that. And that works well. All right. So now we're ready to come in and start working on our plot. And so again, wherever I have Schubert in here, I'm going to replace that with Baxter. So I'll replace all of those. And again, this will allow us to read all this in. Something that I'm going to do, because I know I had to do it earlier, is the call types. And this will be calls group equals call character. And then otherwise, the default will be call double. Okay. And then metadata, again, I'm going to do a mutate, they put in some more white space here, there's no need for it all be jammed together, a mutate sample. And that will be as character on sample. Okay. And then, so that's good. And then our inner join will be sample on group. And then we need to mutate DX. And that's going to be a factor DX and our levels will be normal adenoma and cancer. And again, what this levels is doing is putting things in the right order for the for the disease progression. So that way, if we were to, if we were to say plot things in a, an order, we'd want it to be in the order of the disease progression, I actually don't know that we really need that for this visual, but you never know. So now we come to my legend, this was a chunk of code that created labels for each of the different clouds of points in the ordination. I'm going to comment this out for now. I'm not sure that I want to incorporate that I might leave that for you to play with later on. This GM rich text interacted with that as well. So we need to set the color disease stat to be DX. Something else that occurs to me is that instead of copying and replacing finding and replacing disease stat and changing it to DX, up ahead, I could have renamed my DX column to be disease stat that that perhaps could have made life a little bit easier as we went along here. So this all looks pretty good. I'm going to leave the titles where they are for now. I am going to change my breaks to be normal adenoma and cancer. And I'm going to use I'll stick with gray, blue, red, and I'll put normal adenoma and carcinoma. All right. And so then I'll use these same breaks down here. This is the fill that I'm filming annual that I'm modifying here. And I liked the lighter colors to give it kind of a more muted background color for those ellipses. And I think for the most part, this looks pretty good. So I'm going to go ahead now and run my GG plot line, as well as saving the visual. And so this gives me this ordination diagram. Healthy individuals have a different microbiota from those with diarrhea. This is normal diarrhea cancer. And what we saw was that the cancer was significantly different from healthy and diarrhea. I'm not totally seeing it here. But hey, you never know. Perhaps the red is shifted over a little bit. Again, remember that the ordination is a visualization of the distance matrix. And it removes a lot of information that is in the distance matrix. And so I would believe analysis on a distance matrix, such as what's done with adonis, over what was done in this visual using the ordination. And we can say individuals with normal colons. And maybe I'll do put the individuals with outside of all that normal colons have a similar microbiota as those with adenomas. But those with carcinomas, but different from those with carcinomas. So if we run that, what do we get? We see individuals with normal colons have a similar microbiota to those with adenomas, but different from those with carcinomas. So I've got to kind of play with the spacing a little bit. So I'll put it between similar microbiota as those with adenomas. So I need to remove that. So again, you've probably seen me futz with this stuff before. It's a little bit tedious, but is the difference between making things look good and making things just look kind of half-assed, if you will. So adenomas, let's put it after the butt, but different from those with carcinomas. And so I'll put the carcinomas back up a line. So individuals with normal colons have a similar microbiota as those with adenomas, but different from those with carcinomas. Cool. And then I'll say I need to update my caption down here below. So I said all pairwise comparisons were significantly different using adenomas. So that's not true. I will say the pairwise comparison between normal and between people with normal colons and those with carcinomas were significantly different using ba-ba-ba, but other comparisons were not. Okay. And so again, if we run this, great. So we have our caption is updated. I went ahead and just put in the line breaks where I needed them. It took a little bit of experimentation, but you kind of get the idea. Again, I hope this gives you a sense of what we can do to modify our code from one application to a new application. Some things to think about as we do this, however, I've got the same color scheme for this Baxter data as I did for the Schubert data, right? So things are going to start looking derivative. We kind of see that when people start using common packages. So for example, I can look at a plot made by someone using ggplot and know that it's made by ggplot because perhaps people didn't change the default color scheme or the default way the axes look. They're just these telltale signs. And so if you see the same color scheme propagated throughout a group's figures or across a bunch of papers, then we can kind of see that this is derivative. Something else to think about is that as we went through, you know, there were different things we could do to perhaps make it easier to generalize easier to reproduce across data sets, right? So we talked about, you know, we could we could standardize that that first column is the sample ID and that it's call character. We could also change the column we're interested in to have always having the same name. So we had disease stat and DX perhaps we could standardize that. And as you do that, then perhaps within a group group, it's useful to share your code. And if you do a lot of that, then it becomes useful to turn those into functions and then from there into a package that then becomes more useful to other people and also then becomes ultimately more reproducible because you're not kind of futzing with the code as you work through it like like I did a lot in today's episode. So that's actually happened. So the vegan package actually has a fair amount of data visualization and analysis baked into it. If you do that though, if you follow the vegan approach, then you're stuck for better or worse with the design decisions that the developers of vegan have implemented, right? Just like if you use GG plot, you're stuck with the defaults that Hadley Wickham gives you because he was the primary developer of GG plot. Those are good decisions in some places and other places you might say, you know, I really don't like that. I don't like his gray backgrounds and grid lines that he has as the default color scheme, right? So anyway, know that there's trade offs and pluses and minus minuses in whatever you end up doing. So what I've done in today's episode isn't that far removed from what I've been asking you to do with these episodes. I love it. Absolutely love it. If you took the code that we've been working on in these episodes and applied it to your own data set to see if you can customize the code for your own purposes, a few episodes back, somebody commented on that in the episode saying that they got it to work for their data and that just made me feel so good, right? So if I can reproduce it for my own data sets or other data sets from my lab and you can do it, then I think we're really winning. And again, we see the value of reproducibility. Something else to note about this Baxter data set, we are going to go back and work more with that Schubert data set. But the Baxter data set is the data set that I use all throughout a tutorial series that I have called minimal are that's available at the rifimonis.org website, I think it's rifimonis.org slash minimal are I also use that data set as the basis of a set of three day workshops that I teach would love to have you participate in the next one if you're interested. So please, please, please take this code that we've been working on, apply it to your own data. Something else I hope that this emphasizes is the value of posting your code online to make it openly available for others to use. If you're getting benefit out of the code I'm giving you, just think about the benefit that other people will get out of seeing your code. Again, a fundamental part of reproducibility is that we should be able to take the same code, apply to the same data and get the same result. But that that's only so far that gets us only so far with reproducibility. The true value of reproducibility in my mind is that you could take my code that I've generated with this to generate this plot and apply it to your data to get a new plot and that you can then modify it to do something further. If anything, I hope this also shows you the value of making your code publicly accessible. You're hopefully getting value out of the code I'm sharing with you and walking through with you. Just imagine how much value people will get out of seeing the code that you're using to do your analysis. I can't tell you how good it feels that somebody has dug through my code for papers we've written and we've posted the code and they've gone on to use that code in their analysis and they write me to tell me about it. That's just that's just a win, right? I feel like that is the true impact of focusing on reproducibility in science. Anyway, keep practicing with these concepts. Please spread the word and tell your friends about Code Club. I really appreciate the time you spend sitting with me and working through this data and working through these different types of visualizations. We'll see you next time for another episode of Code Club.