 Have you ever worked on a project for an embarrassingly long period of time, and you thought to yourself, you know, I've been working on this so long, I wonder if there's a new version of the software, or the databases, or maybe even the underlying data that I'm working with? Yeah, well, over the last week I've been editing my manuscript, doing many, many rounds of revisions, and I thought to myself, you know, I'm using the RRNDB. I wonder if they've had an update since I started this project back in June or July. Well, sure enough, they have, and wouldn't you know, they increased the number of sequences in the database, as well as the number of species by about a third. So I thought this was a great opportunity to test all those great tricks that we used for reproducibility, to see how hard it would be to rerun the analysis with the new data. So we'll do just that in today's episode of Code Club. Stay tuned, and I'll show you how we get it done. So there are many reasons why an analysis may or may not be reproducible, right? So it might not be reproducible because your instructions are vague, or just you don't document very well what you've done, and so it makes it hard for somebody else to come along and do it, right? It might not be reproducible because the data that someone in the future uses is different than the data that we have today, right? And similarly, you could be vague about what version of databases or software are using, things like that. And so what we've been trying to do over the past 60-some episodes, as we've been developing this analysis, is to make our documentation clear and making things programmatically reproducible. So we should be able to say, right, you know, we should type out the keyboard, make submission, manuscript.pdf, and voila, it generates all the files, as well as the final rendered PDF of the manuscript we're working on. And it just works. So today, we're going to put that to the test. I'm going to remove all of the old files that we downloaded with the previous version of the RRNDB, and we're going to use the new version of the RRNDB and see how well everything works with this new version of the database. Do I think things are going to change? I have no idea. I really hope they don't change because I'm like this close to submitting the manuscript, right? But I think if we've got the latest, greatest database, more data that will have a more robust answer that is going to be more resistant to changes in such databases in the future. And so that's why I think it's really important before we upload and submit the manuscript to go ahead and update the database. So this database from the RRNDB was released January 18th, 2021, just like three or just a month ago, right? And so if I go to this about page, the font is a little bit small, but maybe if I zoom in, you can see that what we had been using was 5.6. This was released in October of 2019, so it's been about just over a year. So I was kind of expecting this to happen. And what we see is that we previously had 15,486 bacterial records from 4,568 bacterial species, and now we're up to 20,255 bacterial records from 5,716 species. So again, it's gone up by almost a third, maybe 30% or so. And I want to see if the results change, but also as we get more data, we'll perhaps have a more hardened result. Also the thing that occurred to me was that in Figure 1, I showed on the x-axis the number of operons in a genome and the y-axis was the threshold that was needed to get all of the operons from a single genome to collapse down to one OTU. Now I limited my x-axis to those number of operons or genomes with operons where we had 100 representatives of that, right? So we only went out to nine because once we got to 10, there were fewer than 100 genomes or 100 species with genomes that had 10 or more operons. So maybe if we have more data now, we'll actually be able to get out to 10 or 11 in that x-axis. Anyway, that is the motivation behind this and I'm going to share with you how I will update the analysis. Some things I could have done better last time. But also how we can hopefully easily change this. Now I'm in my project root directory. I'm going to go ahead and open up my project in Atom. And so here we are. And one of the things that I see is that I've got these files that I was downloading rndb-5.6. And so this, as you recall from my make files, the stuff on the left side of the colon are the targets or the things I'm trying to build. They often become the prerequisites for later rules. So the things on the right side of the colon is the prerequisite to build that. And so we see that the prerequisite here is a shell script that pulls down the data from the web and then generates this file. And basically we have the same rule to build these four files that come out of the rndb. So if we copy and paste that, we see that this file is used down below to make the aligned file. Now one of the things that I did that I wish I hadn't done was I left the version number of the database in the file name. And so I have hyphen 5.6. And I don't really want to change that. What I need to do now is update that to 5.7. And actually if I copy and paste this, and if I do a shift command F in Atom, it will search my whole project. So I can then put in rndb 5.6. And if I do find all, it'll show me all the cases where rndb hyphen 5.6 is used. And I could then in here, I should be able to change everything to rndb hyphen 5.7. And then that should clear everything up, replace everything. Yep, and I want to replace that 32 times in four files. Again, the better practice would have been from the beginning to instead of using hyphen the version number to maybe just say rndb underscore 16s and not worry about the version number. Because if I had done that, then really the only thing I would have needed to change would have been up here in my code get rndb file. So if we look at that now, we see get rndb files. Where are you? Ah, for some reason that didn't change. And I guess that's right. So I didn't because it takes the target and it downloads the zip version of that file. Anyway, I think this is fine. And hopefully this will work when I run it. Okay, the other thing that occurs to me is that my thresholds that I picked many episodes ago, again, this is the x-axis on figure two, where we look at the number of genomes the fraction of genomes that are split apart at each threshold. And I had thought initially that we would need a really fine threshold because we would see things really fall really quickly towards the left side, towards kind of that closer to that definition of Amplicon sequence ring. It turned out we needed to use quite larger thresholds. What I'd like to do, and so we went back then, of course, and did 0.1, 0.15, 2 and so forth. So what I'm going to do is I'm actually going to replace these, and I'm going to increment in, again, so this says 002, but that's really 0.002, 0.003 and so forth, right? And we go out to 0.1. So what I'd like to do is I want to keep, I want to go in a quarter of a percent increments, if that makes sense. And so we could do 0.0025, 0.575, right? And then that brings us to 1. And so maybe I'll do some copying and pasting to make this a little bit easier. And there's going to be quite a few things here. So that's 0.1, 2, 3, 4, 5, 6, let's see, 7, 8, 9. And I don't know that we need this one. So then we'll increment here as well. And again, what we're doing is we're incrementing the thresholds that we're going to test when we build O2s. We're going to have quite a few more increment values here. And so for this last column, I need to increment this as well. Very good. And that should get us from 0, which is, again, the ASV up to 0.1 percent. And so let's now concatenate this together as one long string. And because I've changed the thresholds that I'm going to use, I need to be sure that I clear out the old directory. And so if I look at LS data, says v4, let's see what's in there that we want to remove. So we want to remove all these countTibble files. So if I look at data v4 countTibble, I see all those files. If I change that 4 to a star, I see all those files. Okay, so I'm going to go ahead. This could be dangerous. I'll remove those. And again, if I double check, I don't have anything there. The other thing I want to remove then is in my raw directory, you'll see that I have these rndb files. So I also want to remove those. So I'll do rm data raw rndb. Get rid of that. And we're going to replace that with a 5.7. And we should be in good shape. Now, if I do make-n on a target, it will tell us everything else that needs to be run to generate that target. So if I do make-n submission-manuscript.pdf, this should tell me all the commands that make will run for me to generate my PDF. So I run that and I see all of these commands. And there's a lot, right? And so it's going to start with downloading everything and then getting the ASVs for all these different regions. We'll also do things like generating the distance matrices somewhere around here, as well as the count. I guess that's done within... Yeah, as it builds these count tables, right? It's going to build the distance matrix and it's going to then cluster those into Otu's. And then it'll extract that to get it into a nice format that we can work with that we will pool with the Amplicon sequence variance. And then that'll all feed into building my figures and then building the manuscript. So it should just work. Now, if it doesn't work, that is not a proof or discredit of everything that I've done, right? That if something fails, that's great because that gives me an opportunity to improve the reproducibility. So heaven forbid this whole process of peer review takes another year and they release version 5.8. Well, if I run it on 5.8, it should work just as well. Anyway, and if I have to do 5.8, then I'll wish I would have changed those file names to not have the version number. But again, like I said, I'm not going to worry about that. One other thing that I'm going to do is I should have done this earlier, is check my git status. And I see the only things I have changed is that make file along with those scripts. One of the cool things we can use once we're done running this is a command called git diff. I can compare what my output file looks like compared to the previous version of the file, the current version right now. And so I can see places where the figure changed or where the code that I have embedded in the text has changed because of this new data, right? So we expect things to change a little bit, but not wildly, right? If things change wildly, then we might scratch your head and what kind of wonder if there's something maybe more fundamentally wrong with our analysis or assumptions about the database and where the data are coming to us from. Okay. So I don't know how long this is going to take. I think it's going to take a long time. So I'll do manuscript.pdf. We'll let this rip and see how it goes. Hopefully we don't get any error messages. One of the other things that I did, you can kind of maybe notice a little bit difference in my terminal in Atom was that I upgraded to Big Sur. I should say updated. I'm not sure if that's an update to Big Sur. And I also have the newest version of R installed. So again, I don't think those things will cause major changes to the analysis or overall problems, but we'll let this run and we'll see if it has problems, we'll revisit the code and we'll make it better. Well, that unfortunately didn't take very long for it to crash out. It looks like it crashed in code. Get genomeid taxonomy.r. I had some built-in checks in the code. So let me double check what that looked like. That was in here. Yeah, so it's filling on this test B where I think I was kind of just going up the GenBank NCBI taxonomy. Before I dig too much into this, something that I realized I probably should have done is rebuild the NCBI lookup files. So why don't I go ahead and do that? I will go ahead and if I look at data references, I see I've got all these NCBI files. I'm going to go ahead and remove those. So I'll do RM data references, NCBI, data references, GC part. I can just remove a bunch of these things. I'll leave the silver stuff in here. I'll also get rid of the the tax stuff. And I think that should do it. So let's see what this looks like, right? So I still, that readme actually came with the download. And I don't think I actually use this lookup file. I'm going to go ahead and remove that just in case that. That and then I'll try to rebuild it and we'll see what happens if we get any further along this time. So it looks like it got through that. Get genomeidtaxonomy.r file without any errors this time. So I think we're in good shape. I'm going to let this run and I'll let you know if I run into any other problems along the way. But I think we should be in good shape. Once we get past that initial stage of getting the rndb files and then linking it up with the taxonomy, I think we should be good now. All right, it went through no more errors. So there were two things that we had to do to trigger rebuilding the whole manuscript.pdf file. So the first was I had to clear out all of those old files, the countable files. Now I didn't have to do that, but the reason I wanted to go ahead and do that anyway was because I changed the thresholds that I was using. You'll recall that I had used really fine thresholds close to zero because I thought the ASVs would actually work much better than they did. So when I re-ran it this time, I took the opportunity to use quarter percent increments or yeah, quarter percent increments between 0.025 percent and 10 percent. So anyway, so that ended up taking about a day to run. Generating those countable files was pretty slow. I guess if I was doing this for a bigger project or if we're redesigning things from the beginning, I might think of a way that I could parallelize that analysis better. So I basically had 41 thresholds and four regions. So that's like 164 different combinations. I could have put each region perhaps or each threshold onto a different processor up on our university supercomputer and got done pretty quick. Anyway, I had a bunch of meetings yesterday afternoon. Anyway, I needed to sleep. I had stuff to do. So I started this kind of before lunch, let it run, babysat it a little bit to make sure no problems happened and then came back this morning and everything looked really good. The other thing that you'll recall I had to change was the NCBI files I had to re-download because I had gotten those back in the fall and no doubt they've been updated and the new version of the RRNDB was using taxonomy node IDs that weren't there in the previous version but that are there in the current version. Anyway, like I said, it all worked great. If I ran git status on my current version of the repository, I see a fair number of files have been modified. Of course, the make file and these code files were changed because we needed to change the version number of the RRNDB. Again, if I had to do it again, I probably wouldn't bake the version number into the file names but who cares? Then the odds that I'm going to run this again are very small. The odds that anyone else is going to run this again are also very small. Although it would be really cool if you try to do a git clone of my repository and then at your prompt ran make submission manuscript.pdf, I'd be really curious to see if you run into any problems because if you can do that without any problems, that would mean that I won, right? That I beat the reproducibility goblins and that I created a product that you could run, right? And so then you could take that and you could modify it to do your own type of analysis. Perhaps you think I missed the boat on something or perhaps you have a question of like, what if Pat had done this? Well, give it a shot. Or maybe you're watching this in five years and we have five newer versions of the RRNDB and you want to say, has anything changed in the last five years? Well, hopefully it would work. Anyway, it worked for today, which is an improvement over a couple of days ago, right? Because we were using a newer version of the database that I didn't know existed until a couple of days ago and so that it worked with this newer version of the database is a testament to the system we created. It's also a testament to the folks at the RRNDB and my hats off to them, of course, because they put out a database that was robust to changes in the underlying data, right? And so it didn't, they didn't change anything that then broke my code. That's always a risk, right? That we're dependent on a piece of data from someone else that when we bring it in is going to break our code. That didn't happen. One little thing, like I said, that did break the code was the NCBI. It might be nice if they had version numbers in their files or some more documentation about the version of the files that they were using. Anyway, not a big deal. I think this all worked really nicely. So did anything change, right? So let's see if anything changed. So one subtle problem that I noticed was that I don't have a markdown file in here for my manuscript. I do have the tech file, right? And so we're going to use that and we can use the get diff command to see if anything changed from the current version to the previous version that I have committed. So you'll notice that before I ran the new make command, my repository was committed, right? And so I can compare the current version of the tech file of the manuscript that has the new version of the database against the older version that was in the repository. And to do that, I can do get diff submission manuscript.tech. And so this gives us the output. Now, one of the challenges with this output is that it's line by line, right? And so in a way, this does look nice because I can see in red here is what was deleted and what's in green is what's added, right? So basically we take out this line and we replace it with this line. And as you can see, the number of genomes went up to 20,000, 427, which we knew. So that's nice. And so what I can do is I can scan through here and I can see that not much changed. One thing I noticed though is that I have 5.2 and what I probably meant was 5.25. So what I should do is go into my RMD file and where I am outputting a percent, you'll recall that I have one significant digit to the right of the decimal point. So I think I need to change that to two because I have that greater resolution in the thresholds that I was using. But again, as I look through here, I'm not seeing a whole lot of changes. Again, 0.6 to 0.58. Again, not a big change, which is nice to see. The other thing you'll see in this output is that we've got two colons in the bottom left corner. That means there's more to see. So if I hit the down arrow, I see more things. And so I see that there's about 91 more mycobacterium tuberculosis genomes. But there's still only one RM operon per genome and that we see that there are now 17 ASVs across those genomes rather than 11. Right. And so I think this, I feel pretty good about this. I feel like not a whole lot had changed. We see more variants, more ASVs across all E. coli than we had before. But of course, now we have more coming down through here. We see subtle changes in the threshold. Again, that 5.2 versus 5.25. The other thing that I'm noticing that I don't know how I feel about is that this line, I'm talking about what threshold you would need to use or if you used a 3% threshold, what would you be confident that you've kind of collapsed down to a single OTU? What number of operons in the genome? And so before I excluded 10 copies because 10 copies was below 100 genomes. I didn't have 100 genomes in the database that had 10 copies of the operon. And so now I've got more than 100 genomes for the 10 copies. And I think what we'll see is if we look at the curve that it falls down. And so we see that the curve, the line goes up and then it does fall down for 10. So I think what I might do is maybe modify the language in here or maybe I might up that because I think this is true. But at the same time, what I saw for like 789 was that you actually needed a larger threshold to collapse those down to one copy. So I think what I'll do is I'll leave this or modify this so that I can kind of keep this previous result. But I need to add language to make it clear what I'm doing. And of course, I'm going to show the figure so I'm not kind of pulling a fast one on people. Anyway, keep scrolling down here. We again see small changes in the values, which again tells me that this is really robust to a change in the database. And that I think next year if they release another version that the results shouldn't change a whole lot either. So that's awesome. So again, this output was generated using get diff. Another way that we can get similar output using get diff is again use get diff but use hyphen hyphen word hyphen diff. And then we can again give submission manuscript tech. And what this then gives us is instead of showing us what lines changed, it chunks it by the by the words. Right. And so you'll recall that that I had this stuff bold faced to indicate what was what was being generated by our code. And so you can see side by side what code is being deleted and what was being added. And again, you kind of scroll down and you see much much the same that we had before. So again, get diff will show you line by line what changed. Get diff hyphen hyphen word diff shows you on a word level what things changed. A third approach for showing what changed between runs is to make a track changes document. No, I don't like using track changes for editing. I do like using track changes to show overall what has changed. Right. So I like in this context to say when I change the database, you know, what changes in the text. Also, once we've submit this, hopefully we get a somewhat positive feedback from the editor and they'll want some revisions. Then when we upload the new version, we'll have to show the editor and the reviewers what we've changed. So there we'll also need to track changes version of the document. So how can we generate that using tech using the tech? Well, the first thing we need to do is we need to get an old version of that manuscript tech file. To do that, we can use a command called get show. And so if I do get log, I can get the log history or commit history of everything that I've done. And so this number here, this hexadecimal number is the commit number. And I can do, I can hit Q to get out of that colon. I can do get show and then the commit and then a colon submission manuscript tech. And so that will then output the version, the previous version or any version that, you know, corresponds to that commit number of manuscript tech that's in the submission directory. And I can output this as manuscript underscore preve.tech, whatever you want to call it. Just know that you've called that, right? I can then use latech diff to compare the old version to the new version. So I can do manuscript prev.tech and then submission manuscript.tech. And then I can output that as diff.tech. By the way, I think it's pronounced tech, not techs. So if you know better than me, which it wouldn't take much for you to know more better than me, put it down in the comments below. Let me know what you think it is. Finally, we now have this diff.tech file that we can then do PDF latech diff.tech. And it then generates that file. We see that we get an error message saying file dot dot figures copy number threshold plot PDF, not found, right? And again, this is because of some of the kind of screeness that happens with using our markdown that we had to use that here package to tell it where, to tell our markdown where everything was. And so this isn't playing so nicely with latech. So what I need to do is I need to go into that diff.tech file, remove the dot dot period dot dot forward slash figure so that we can then render it. So I'll go ahead and type quit here at that question mark. And if I open up my diff.tech file and I scroll to the bottom, I should see these dot dot figures lines, right? And so I need to get rid of the dot dot figures and save that. If I were putting this into the make file, then I might want to make it so it's not so manual to fix, but I'm good with this. It outputs it as diff.pdf. So if I open my diff.pdf file, I then get my nice PDF, which doesn't look too different yet, but voila, we now see that the old stuff is struck through in red and the new stuff is underscored in blue. It's kind of funny how it chunks up the output here. I think the commas are throwing it off. But again, this gives you a track changes version of what's changed in the manuscript. And we can use the same approach later, which again, hopefully we get invited to resubmit the manuscript with some edits to make it better. But again, you can see what has changed here in the manuscript and everything looks good. And of course, at the bottom here, we do get to see our figures. Excellent. Very good. So I'm going to go ahead and remove all those diff files as well as the previous version because I'm not, I don't want those to be stored in the repository. So I'll do RM diff as well as manuscriptprev.tech. And if I do get status, I see everything there. The other thing I noticed is if I do an LS, I've got all these mother log files. So I'll do RM, mother, and then star log file. Those didn't show up in my output of get status because I have a .gitignore file ignoring those files. All right. So again, get status has that. LS is nice and clean. So one last thing while you're here that I'd like to do is I'm going to go ahead into my RMD file. And I'm going to put the abstract as well as the title of my manuscript. So I will do that by going to submission and then manuscript.RMD. And I will go ahead and highlight all this stuff into my RMD file. It's probably not all necessary, but you know, why not? And the title is here. So I will copy that into my title. That's that. And then the date is 22, 21. That's good. And then I need to, of course, include all this code that I had as well as the abstract. One of the things you'll notice is that during my editing, because I have our code in my abstract, I needed to put all those code chunks up ahead of all the documents. So I'm going to go ahead and copy that into my RMD file. Where's the top here? Here we go. And I will remove this, plop that in there and I will make a... Yep. And so that looks good. And I'm going to close this diff tech tab. I don't need to save it because I've already deleted it. And I can then make readme.md. And hopefully that works fine. Now, if I look at readme.md, that has the numbers embedded in there. And that looks great. So I've got a few bits of cleanup to do on the manuscript still. I need to increase or change the version number of the RMDB. I also need to update the increments that I used. I think the current version of the manuscript says I used half percent increments. I need to update that to be a quarter percent increments. I also need to change the output so that I include two decimal points to the right of the decimal for those percent values. And then I need to give it one or two more looks and we'll be ready to submit in the next episode. So I hope you found this really exciting and intriguing. Again, the ability to do this so easily by doing make submission forward slash manuscript.pdf was empowered because I, from the beginning, used everything under the control of make. I was then able to compare the output to the previous version because I've got everything under version control with git. And so again, those are kind of bells and whistles for why you should use git and make. But it really has helped the project over the course of the project to be using those tools to help foster reproducibility and to have a better control of the overall process. So I hope this is another selling point for you to get you thinking about what tools you should use to help your reproducible workflow. One of the other things that you could do is that you could do a git clone of the project from GitHub. And then at the prompt, you should be able to do make submission forward slash manuscript.pdf. The only reason it might not work might be because you don't have the same dependencies that I have. I could give that another shot on my computer, but why suck up more carbon than I already have on this project? I'm excited to get ready in the next episode and go ahead and submit the manuscript to Amsphere and will also post a preprint version of this so we can get the feedback of other people that haven't been, for whatever reason, watching all the past 70 videos or so. Anyway, I play around with make and git, watch those old videos where we set up the project to see if this might be something that you could work into your project and your workflow. Please tell your friends about what we've been doing here on Code Club, something else that would be really helpful to me on the community tab of the Riffamonus channel. I put a question in there asking, what would be the next thing you would like to learn about from Code Club? Once we've got this submitted, it's going to take a month or so to get any feedback from the journal. So what would you like to do in the meantime and further on? I've got some ideas, but I don't know that you might have ideas that would be even better than mine. I suspect they're far better than mine. Anyway, keep practicing and we'll see you next time for another episode of Code Club.