 Have you ever written a scientific paper that you've submitted for peer review? The hardest part is getting started. Well, that and actually doing all the science that you then have to write up. Well, stick with me because today we're going to get started. Hey folks, I'm Pat Schloss and this is Code Club. In each episode of Code Club, I try to apply principles of reproducible research to an interesting biological question. Well, at least I think it's an interesting question. We've been looking at the sensitivity and specificity of Amplicon sequence variance to differentiate between different bacterial species and furthermore in more recent episodes, what threshold, what distance threshold between those Amplicon sequences we should be using to define operational taxonomic units that don't again run the risk of splitting genomes into multiple bins. Well, I think we have something unique to contribute to the literature that hasn't been discussed before. If you're like me, getting going on writing a new paper can be a little bit daunting. So we're going to do this together and we'll get started today. Before we get digging into thinking about the paper, why don't we go ahead and summarize what we've been discussing over the past several dozen episodes. So if you've been watching, you know that I struggle with what jargon to use and I really hate jargon and debates about jargon. And so I got an email from somebody about another paper that I had written and submitted and posted as a preprint and they were kind of questioning my use of Amplicon sequence variance in that paper. And so that forced me to like really kind of dig in and think more deeply about what we mean by an Amplicon sequence variant. And so the people that proposed Amplicon sequence variance were constructing this tool called data to, but there's other tools that I think generate Amplicon sequence variance. So mother certainly does my software package has the pre dot cluster function. Robert Edgar has a program called you noise to or you noise three or whatever it is, but you noise. And then the night lab has something called D blur. And before all that Chris Quince had seek noise. Sue Hughes had SLP single linkage. Something pre clustering single linkage pre clustering and Merritt Aaron had a Lego typing that was based on looking at Shannon diversity. So there's this like long history. And what it what I think we need to stick with for the jargon is that Amplicon sequence variant is a cleaned up sequence, removing sequencing error. Now to do that, there is some lumping and splitting that happens on the sequences to get them into these idealized sequences that don't have any sequencing error. Okay, but then more broadly, we might take those individual sequences and cluster them together. And so I guess the general question is, is that individual Amplicon sequence variant a meaningful unit of inference? And if not, what should the meaningful unit of inference be? We also talked about exact sequence variants, which would be, you know, exact sequences that don't have any sequencing error. And I think that kind of gave out because people recognize that we can't really have exact right. And so it's, I think I feel like ASVs are a type of OTU, an operational taxonomic unit, but at the same time, the people that push for ASV over OTU have really strong feelings. And so I think it's probably best when we talk about ASVs to think of them as the individual sequences that are really trying to figure out what is the 16S operon sequence in a genome, right? Minus any sequencing error and minus any further type of clustering. Anyway, that's what we're going to try to work with as our definition going forward. So where are we? Well, we know, going back before this discussion even started, that 16S RNA gene sequencing is a really powerful technique for describing and comparing microbial communities. In the past, prior to this discussion of ASVs, we've been analyzing them using classification-based methods. We might take a 16S sequence and compare it to a database like, say, the ribosomal database project or perhaps at silver or before that, green genes. And so we'd figure out, you know, what is the sequence similar to? Alternatively, we might take a group of sequences and then cluster them to each other within that dataset and figure out where are the clusters of sequences that are most similar to each other. And we would then define that as an operational taxonomic unit. Traditionally, it's been at about 3% difference or 97% similarity. So in recent years, there's been a push towards Amplicon sequence variance. And kind of that push, again, is motivated by not having to cluster your sequences, thinking that we have really high quality sequence data so that we can approximate or get really close to exact sequences or the exact sequences that were in that bacterial genome. And also recognizing that our databases are limited for the reference-based approach and that, well, 3% is kind of a made-up number. And also that with our de novo clustering algorithms that if you repeat it many times, you can get different clustering. And so it's not a absolute definition, but it's operational. And it kind of depends really on the data that you have. And that variability gives people kind of the willies, I guess. So that's kind of where we're at as we think about this analysis. So the effort to link 16S to taxonomy goes way back. And there's a paper by Stackin-Brent and Goebel. I'm sure I'm mispronouncing both of those names, even though my own last name is German and they're German. Anyway, so they wrote a paper in the International Journal of Systematic and Microbial Evolution, ISJME, where they looked at DNA-DNA hybridization between sequences. And so that's kind of the gold standard for classification, where you basically take the genomic DNA from two strains. You shear it to small bits. You pull it together between these two organisms. You denature it so it's single-stranded. And then you look at the rate at which that DNA then forms double-stranded molecules, right? And so if it comes together more quickly, then the strains or the genomes are more similar to each other. And if it's slower, then they're more distant from each other. And for bacterial taxonomy, 70% is the number that is kind of the gold standard for defining a species. Don't ask me where 70% came from. I feel like that's also kind of made up. Anyway, what Stackin-Brent and Goebel did was they looked at that hybridization data and said, well, how far apart are the 6 and S sequences of genomes that are 70% similar? And what do you think they came up with? Well, about 97% similarity, 3% dissimilarity. And so that's where we have 3% today. Now that we have genome sequence data, people have gone back and they've looked at the genome sequence data and done basically the same thing, but instead of using DNA-DNA hybridization data, they use genome sequence data and they come up with all sorts of numbers that are very small. So like 1%, 0.1%, 0.1%, just keep putting zeros in front of that one. But very fine level differences to link genome level variation to 16S level variation. And again, I think that's part of the impetus for thinking about amplicon sequence variants and using those as our unit of inference. But the problem is that each bacterial genome can have many copies of the 16S RNA gene. Sure, some genomes only have one copy, but many genomes have many copies. We saw that some had as many as 20 and perhaps even more than that. And those sequences are not identical to each other. When I'm teaching people about how to analyze microbial data, I often tell them that this is one of the deep dark secrets of microbial ecology that we're analyzing 16S sequences knowing that the copy number is not consistent across all genomes. Also, again, they are not identical to each other. And that's one of the things I feel like these studies are missing is that, sure, you could say, well, it should be 0.1% distance to be in the same species or less than that to be in the same species. But you would often then split individual genomes into multiple species or individual taxa. And I think people kind of have an intuition of this when they say things like, you know, we really can't define a species based on a 16S sequence, but that doesn't stop people from trying, right? And so that's really, I think, one of the struggles and things that hasn't been addressed when thinking about Amplicon sequence variance is what is the risk of splitting a genome into multiple bins if we truly have exact sequence variance or we want to call them Amplicon sequence variance. So, again, because most bacterial genomes have more than one copy of the 16S RNA gene and they're not identical, if we use too far in a threshold, there's a risk then that we can take those copies, those operons, and split them into multiple genomes. So these are not wacky weird bugs that have this problem. E. coli k12 has seven copies of the 16S gene and five different versions of that 16S gene. And so if we truly used exact sequence variance or Amplicon sequence variance as our unit of inference, we could split DRLD coli into five different bins. I don't think that makes any biological sense and isn't something that we should really be pushing. Alternatively, if we use too broad of a definition, then we run the risk of lumping things together, right, that we could, so for example, things like Bacillus Sirius, Bacillus Thuringiensis, and Bacillus Anthracis are very related. Some would say they're actually the same species. They really only have subtle differences, but their 16S sequences are basically identical to each other. And so we run the risk then of lumping those three species together into the same bin, you know, even at just like a very small threshold for defining a bacterial species. So we have this tension that we need to trade things off and that was what we're interested in with this study. So in this study, we were interested in using genome data collected by the RNA number database project and looking at sequence level variation in those genomes and how it related to taxonomy as well as two different thresholds that we might use to think about defining Amplicon sequence variance as well as OTUs. And one of the things that I perhaps forgot to mention is that when we think about defining even ASVs and not even thinking about OTUs, is that there is some amount of clustering that goes on. There is some amount of noise that we're lumping together to approximate again that exact sequence variant. And so in data two, there's this parameter called omega. A mother has an argument as an argument called diffs, which is again the level of difference that we're willing to accept between a more abundant sequence and a rarer sequence. So even in defining ASVs, there is still some lumping and splitting that goes on and we don't really have a good theoretical basis for what that number should be. Anyway, that's the goal of this study. And so as we dug into this, and of course the questions and the study evolved as we went along, we had kind of a couple of different lines of inquiry using that data from the RN number database. So first showing that copy number varied by taxonomy. They've shown this in the papers for that database that within a group, within a taxonomic lineage, it's fairly consistent in terms of the number of copies that are there. What we've seen is that when a genome has more copies, it also tends to have more variants, right? And so what we've also seen then is that longer full length sequences have more variants than the shorter regions. So if we look at data for full length, the V1 to V9, there's gonna be more variants per genome than say at the V4 region or the V34 or V45 region. We also saw that some species have a lot more genomes in the database than others, right? Like E. coli had close to a thousand, if not more genomes in the database, whereas others maybe only had one. And so what we could see from that imbalance was that as we sample more and more genomes from a lineage, we still see new variants of the 16S gene popping up and that we don't exhaust all possible versions of the 16S gene. And so as much as we'd like to think that there were only so many variants per species, it does seem to keep going up with additional genome sampling. So that got us thinking about ESVs or ASVs, whatever we want to call those. Again, if we could define a perfect representation of what was in the genome for the 16S sequence, what would things look like? And again, the problem then is that we split things apart. So if we then scale back and we look at a broader threshold of defining a taxonomic unit or the unit of inference, the OTU, how does that splitting counteract the lumping that we might see between different species of bacteria? And so to do this, we looked at pulling representative sequences out of each species and comparing the operons in that species against other species. And what we found was that, again, there is this trade-off between lumping and splitting, and somewhat surprisingly, that if you went out to about three to five or six percent, depending on the region you're looking at, you then kind of balance the sensitivity and specificity or as we've been saying, the lumping and splitting of different bacterial taxa. And so that's pretty surprising that if we want to mitigate the risk of splitting things apart, then we need to pull back to a broader definition. And I'm personally cool with that because I don't think there's any biological sense in taking E. coli and splitting it into five different units of inference. Again, if we use a broader definition of a taxonomic unit based on distance, then we could lump all those together into one unit of inference, and I think that's cool. And so if we're lumping Bacillus Sirius, Anthracis, and Thuringiensis together in one OTU, that's fine by me as long as we're not splitting them into five or six different groups. So that's basically the data that we have been generating over the past 48, 50 episodes. And so what do we conclude? So frankly, I don't think that an ASV, as defined as a perfect sequence coming off the sequencer that's been cleaned up and denoised, I don't think that ASV is an appropriate unit of inference because again, the problem is that we're going to split things like E. coli into many different bins. And there's just no biological argument for splitting E. coli or things that have multiple copies of the 16S gene with multiple versions into different bins. And so it makes a lot more sense to lump those together into some type of operational taxonomic unit that we can use for further analysis. And again, if we want to limit splitting, we need to look at broader definitions. And if you're interested in species level definition, you're not going to get that from an ASV, right? You're not going to get that from a 16S gene. Even if you have a full-length 16S gene, you really need genome sequence data to be able to look at the species level with 16S. We're really asking just way, way, way too much of the 16S gene to delineate bacteria based on species. The other thing I'll say is that when we define species of bacteria, it's humans, largely, that are doing it. And sure, there's this DNA-DNA hybridization approach, but even that 70% is somewhat just kind of ad hoc. And even there, there are clear examples of things that we know by genome sequencing should be separate species or still should be merged together. So even DNA-DNA isn't quite like the gold standard or nor is genome sequencing the gold standard that we really hold it up to be. So, because these species definitions are not universally applied across bacteria, then, you know, it's not possible to take something objective like a distance between sequences and expect that to fit our kind of biased representation of what we think is important. Overall, I'm pretty surprised that the 3% definition that's been widely used is pretty useful across the field. If anything, maybe it could even be broader, but because it's so widely used, I'm cool with using 3%, and I was kind of gratified that it actually worked out to work pretty well. Does that mean 3% represents a species? Absolutely not, because we're still lumping things together, and as I've said, we cannot represent species with a 16S sequence. So that's the story I want to tell. Again, I think it's something novel. It doesn't make a huge contribution to the field, but I think it's an important contribution to maybe get people to take a step back and think, really, what should the unit of inference be? So I'm going to write a paper, and we're going to do that over the next episodes during the month of January after taking a break here for the holidays. The first step to help us get over the problem of writer's block is to create a file and to start mapping things out. We're going to write a paper using R Markdown, which is probably very different from how you've written papers in the past, but isn't going to be that far off from how we've been looking at our exploratory data analysis. So I'm in my project root directory. I have a directory called submission, and that is where I put things for writing a paper. So I'm going to create a new file in submission that is going to be called Manuscript. I'm going to call it .RMD, because even though we're not going to make it an R Markdown document just yet, we'll get there soon enough. So I'll go ahead and open up SubmissionManuscript.RMD. So the way I like to start my manuscript is with an outline. I hate the idea of starting your paper and just writing, right? I don't think anybody succeeds that way. The other thing I don't like is outlining by figure legends. I think that's kind of a horrible way to start as well. I know that I have many colleagues who would disagree with me on that. They think that the figure legends are the way to go. I think about the figures last. Sure, I've got this, like, battery of figures that we've generated in our exploratory data analysis that I have in my head and I've got examples of in my exploratory directory, but I want to tell the story. And I'm going to use the figures to help tell that story, right? You know, I'm not writing a paper around a set of figures. I'm writing the paper and then using the figures to help buttress my claims, okay? I know it's a slightly different way of thinking about things, but that really helps me to write, because I find myself and others spend a lot of time working on figures, and in the end, we might throw the figure away. So why not write the paper first, figure out what figures we need, and then add the figures in? So that's what I'm going to do. So how do we get started? Well, an easy way to get working is to lay out the sections that we want. So an abstract, an introduction, a results section, a conclusion, or a discussion section, and then a materials and methods, and an acknowledgments, right? That is our basic outline. And we'll talk about the structure of this and kind of what goes into each section as we go along. Something to know about is a technique called ABT, and, but, therefore. And so this general layout follows the and, but, therefore approach. And we'll be talking about that in an episode right after the break. So what I just did to start out this episode was to orally describe the place of the research, kind of the motivation behind the research, what we did, what we found in my takeaways, right? That maybe took 15, 20 minutes to do. Well, so if you can do that out loud, great, I know sometimes there are competitions for a three-minute thesis where you try to describe your thesis or your work in three minutes or maybe one minute or 30 seconds or 10 seconds, right? So you kind of take those ideas and if you can transcribe that, awesome. Otherwise, think back about what you said and hopefully that's some type of rough outline. So I did that, right? And so I'm going to replace what I had there with my outline and you'll see how you have an introduction, I have a results section, I have some conclusions, I have some startings of materials and methods, some figures, I guess I could also add the acknowledgments, right? And so, again, I'm not writing great sentences in here, things might change, I don't have a lot of details, but this is my outline. And as I write, I can pick the sections that I want to work on. Generally, we would start with the results section and then I would perhaps go to conclusions and then back to the introduction and then work on the materials and methods. The intro is really going to help stage the results. So I don't want to write the intro until I've written the results so I know the context of what's going on here and also the conclusions are going to help synthesize what we saw in the results. So I generally also want to write that after I write my results section. All right, I've got my outline. I've started working on the paper. That's wonderful. One other thing that I think is very important to do at this stage in working on a paper is to think about where are you going to submit it because every journal has different specifications of what they're looking for. Maybe that's different formats, maybe that's different sections, maybe that's different standards for what figures look like, things like that, okay? Again, you don't have to narrow things down to one journal that you want to submit things to, but it's helpful to have a guide of where you might want to submit the paper. And in full disclosure, I am the chair of the Journals Committee for the American Society for Microbiology, ASM. They publish a lot of great journals that I've published in frequently. I think it's important to support our professional societies and I think the ASM product is amazing. I've submitted to lots of other journals and publishers with horrible experiences that I could just complain about forever, but I've had really positive experiences with ASM and their journals. And so a couple journals I might think about submitting to would be perhaps Applied Environmental Microbiology. That's where I published the mother paper and the daughter and the son's papers and lots of other papers. Publishing in that journal has really helped my career. More recently, we've published a lot of stuff in M-Sphere at ASM. It's an open access journal with a broad general audience. One of the things I know about AEM is that they're not always so receptive to taking methods papers, kind of like this. I'm not really saying much about microbiology here. I'm saying something about how to interpret microbiology. And so AEM isn't always so receptive to that in my experience as an author and also while I was an editor at AEM, whereas M-Sphere is much more open to that kind of thing. Another journal I might think about would be M-Systems because these types of techniques are used a lot in M-Systems, but I'm not convinced this is really like systems biology and should be published there. Again, I don't really care where it gets published. I just want it published. One of the things I also like about M-Sphere is I get really quick turnaround times on the papers we've submitted there. Generally in about a month, four to five weeks, we get a decision back. I've had really good results with the reviews I get back that they're always pretty helpful. Again, everybody always has different experiences. So I'm kind of leaning towards M-Sphere at this point, but possibly AEM, I don't know. So what I want to do next is open up my browser and let's go to M-Sphere.asm.org to their website. Let me make this a little bit bigger. And what we're looking for is a document called the Instructions to Authors. And so you can go for authors and there's a whole list here of all sorts of things you should be aware of. So getting started, submit a manuscript, scope. So scope is always a really important thing to look at when you're looking at a journal. And so it tells us that M-Sphere is a multidisciplinary open access journal that focuses on rapid publication of fundamental contributions to our understanding of microbiology. Our scope reflects the immense range of fields in the microbial sciences. That's where I want to go. If I was perhaps working on cancer microbiome research, then maybe I might want to go to a journal published by like AACR, the American Association of Cancer Research, because they would be a little bit more focused on cancer research, right? If I was developing a bioinformatics tool, maybe I would think about more of a bioinformatics type journal like PLOS Computational Biology. You know, you really want to be aware of the scope. I'm not going to submit this type of work to the Journal of Clinical Microbiology because this isn't clinical microbiology and it is not at all within their scope. So you can save yourself a lot of heartache and pain by making sure your work fits within the scope. If it doesn't, if you're not sure if it fits in the scope, email the editors. Ask them, would this fit with your journal? Okay. So instruction to authors. This then gives you, again, a statement of the scope. Comments about ethics, copyright, warranties, all these great things. And so I see for initial submissions, M-Sphere welcomes papers in any format for the initial submission. So I don't really care how things are formatted. I don't really care about how my references are formatted. They will take it. Now on the next round, they will care. But for now, we'll be in good shape submitting it. It's all electronically at this point. The review process is very slick and easy. And there's things in here, like if you've submitted a paper to a non-ASM journal, you can submit to M-Sphere effectively like a resubmission where you'd package up your revisions, your reviews, and your rebuttal and submit that to M-Sphere. And they might give you an expedited review, which is pretty sweet also. There's all this stuff. So the other thing I'm looking at are things like editorial style. So while I can write it in any format I want, it's easier on the reviewers if I follow the kind of format that they expect to get. They have information here about article word counts. So for research articles, I've never had a problem with using too many words, about 5,000 words. I'm thinking about an observation. And this will be a shorter form paper. This is also a reason to perhaps prefer M-Sphere over AEM. And AEM doesn't have the short form papers, but M-Sphere does is about 1,200 words. Opinions, hypotheses could perhaps be 2,500 words. I'm not sure this is really an opinion or a hypothesis. I kind of gristle at that to bristle at that because I've got data, right? I'm backing up our statements with data. And so I don't know that that's so much an opinion or a hypothesis. Anyway, if we think about an observation, then 1,200 words should perhaps be where we're at. We can see kind of what they want in the different sections. Most of this is for a typical research paper, which is most of what M-Sphere is going to publish. See things about reference formatting. And let's see. I think down below here, there's a comment on observations. 1,200 words, no more than two figures and 25 references. And so that might sound... So the body of an observation may have paragraph lead-ins, but perhaps not section titles. And that we would need an abstract and an important section. So, again, this gives me a good sense of the structure of the paper and what it should look like if I want it to be an observation formatted paper versus a full research paper. I'm not so sold that this really needs to be a full research paper. And I know sometimes people are annoyed at word limits and figure limits. I find that that actually might help me to clarify my writing because if I know that I only have 1,200 words, well, I'm going to be much more direct and to the point and not just kind of filling space like when I don't have a word limit, say, right? And if I only have two figures, well, maybe I'd have one figure related to things like specifically to ASVs or ESVs and one figure about kind of this definition of our unit of inference and what threshold that should be. So I'm kind of liking this idea of an observation and a shorter formed paper. So I will come back here and add in my abstract section. And so this will be 250 words and an important section, which would be 150 words. Okay? This is how I get started in thinking about a paper is I lay it out, perhaps you've given a talk and you can then use that talk as an outline for thinking about the structure of your paper. I think it's really important to get the structure of the paper right first before you start wordsmithing and kind of doing copy editing type things. Again, for the same reason that I don't like to build a paper around figures, you don't want to build a paper around, like, key sentences and things like that because you might throw away that sentence in the end just like you might throw away that figure in the end. Get the structure right. How are you going to tell a story? And one of the things I'm going to emphasize over the coming episodes is how we need to tell a story. It's not just a list of facts. We need to tell the story. We need to lead the reader through the paper, help them understand why this is important and what context we're presenting this material in. All right. So again, that's what the outline gives us and we might come back later and change the structure of the paper, but we've at least started with structure and we're thinking at a very broad level and as we write and do our revisions, then we'll do more of the fine-tuning of our words and the language that we use, okay? So I hope this helps you think about your own writing and how you can get going. I know people are always thinking about resolutions for the upcoming year, and so perhaps this will help you get writing on your papers a little bit quicker. Think about an outline. I've told people in my research group or collaborators or people I'm mentoring, where's your outline? Give me an outline. And to me, if you know me well enough, you know that a list of figure legends is not an outline. Give me an outline, something that you can use to then flesh out as you start writing paragraphs and sentences and sections of your paper. All right. Well, like I said, this will be the last episode until January. We'll be back around January 4th. So I hope you have a great holiday. Be thinking about your paper writing and we'll see you in the new year.