 Hey, I've got a question for you. Do you know how to eat an elephant? One bite at a time. It's the same thing with writing. In today's episode of Code Club, I'm gonna show you my process for getting out a first rough draft. Hey folks, I'm Patch Loss and welcome to Code Club. In each episode, I try to show you how I apply principles of reproducible research to the analysis of an interesting biological question. Today's episode is gonna be a little bit different because it's not really going to be showing things that are explicitly about reproducible research. Instead, it's gonna be about communicating research. And I guess if I had to think about it a little bit harder, I would say that does have a lot to do with reproducibility because if I can't communicate what I have done in basic prose and basic text, then it's not gonna be reproducible no matter what, right? And so today, I'm gonna show you how, again, I eat that elephant one bite at a time. Now, in the last episode, I talked about forming an outline and how powerful an outline can be to quickly get ideas down on papers to help me to conceptualize the story. In today's episode, I'm gonna show you how I start to flesh out that outline to create a really solid first draft. Before we get going, I want to introduce you to an idea that really blew my mind. I heard about this for the first time a couple years ago, and it wasn't until in the last year when I actually listened to the audiobook version of the book, Houston, We Have a Narrative, that things really started to crystallize for me. It's the technique that was proposed by Randy Olson, and he is very quick to acknowledge that this is not him, but he popularized it. The idea is ABT and but therefore. And these three things are key to telling a good story. And yes, when we're talking about science, we have a story to tell. If we don't tell a story, it's gonna be boring. No one's gonna read it. People love to say, oh, but my data will speak for themselves. Like, no, they won't. Data, their digits, their numbers, their pictures. We need to put them into context for the reader so that the reader knows what we are trying to say. What do the data say? And so again, this ABT and but therefore structure really helps us to tell a story. So let me try to tell you a story. Okay. So once upon a time, there was this girl and the girl lived in Kansas and the girl lived with her aunt and uncle and they lived on a farm and everything was going really well. And one day there was a mean lady and she wasn't really nice and the girl hit her head and the girl had a dream and the girl met, you know, these various characters while she was having a dream and a bunch of other stuff happened and the girl woke up and her life was changed. So you may have heard me say the word and a bunch in there, right? And perhaps you recognize that story a little bit. If you grew up in the United States, I guess that story is fairly familiar. But let me tell you to it a little bit differently, right? So there was this girl and she grew up in Kansas and it was a really kind of like dust bowl type environment and she lived with her aunts and uncles and we don't really know why but, you know, she doesn't quite feel like she belongs there. She had a dog and she loved that dog because the dog kind of gave her a sense of belonging. But there was this mean cranky woman that didn't like the dog and the cranky woman threatened to kill the dog and to kind of ruin this poor girl's life. Therefore the girl tried to run away and the girl found this kind of carnival barker kind of guy who was selling snake oil and he didn't really have a lot of answers for her. So she decided to go back home to her aunt and uncle and there was a storm and she got hit on the head and instead of waking up on her own bed, instead she woke up in this magical land. So you can kind of hear me telling a much different type of story here and I'm not going to tell you the rest of the story. Go see the movie. It's The Wizard of Oz, right? So the first example where it's a bunch of ands and and and and is very boring and that's oftentimes how we write a lot of our papers. Grab a lot of abstracts off of PubMed and you'll see the written in this and and and and right. There's no tension. There's no conflict. There's no like, you know, there's nothing kind of driving the story. There's no motivation. If you can throw in those butts, those butts, and I don't you don't have to say but every time there's other words that you can use in place of but you build that conflict. You kind of drive the story and then how does that conflict get resolved with a therefore, right? And so if we think about this in terms of science, right, we could lay out our our literature in the introduction and we could say, you know, we know all these things and but there's this gap in knowledge. There's this thing. We just don't know yet. Therefore, I'm gonna propose to do this study, right? So in that first couple paragraphs of your introduction, you see an and but therefore, right? And then we move on to our results section and the results section can also tend to be very andy, right? Where each section is a different set of facts and they're all kind of clumped together, but that's kind of boring, right? Instead you might say, so the first thing we were interested in was this question. So therefore, we did this experiment and then the next paragraph, but you know, you could think about it this other way. Therefore, we did this other experiment and we saw this other result, right? And so you kind of have this and but therefore that very easily flows in our scientific writing. Again, if you read most of our papers, you don't see that. Now the flip side of that, but I don't think we were on the risk of but is on my mind, is kind of the and, but, but, but, but, but, but there's just way too much conflict going on and nothing ever gets resolved, right? So I don't know if you've seen this latest Pixar movie Soul, which was very sweet, but as I was watching it, I now can't help watch movies and think about this and but therefore structure and they just kept going but, but, but, but they just kept being new conflict that took you on to different tangents of the story. And eventually everything gets resolved, but it's like way too complex and way too hard for at least me to keep track of much less, you know, eight year old kids in my house. So there is a balance that we need between and and and and, but, but, but, but, but, but, or and therefore, therefore, therefore, right? And I think having this and but therefore structure, something I'll try to show you as I think about my own outlining of my manuscript. Okay. So what we're going to do today is I'm going to show you how we take the outline from the last episode and I'm going to flesh it out a little bit more to bring on this and but therefore structure. And of course I would strongly recommend that you go check out Randy Olson's book, Houston, We Have a Narrative. It's really good. He's got all sorts of worksheets in there and help you to think about how you tell your story. Okay. So what I'd like to do is take the outline that we generated in the last episode and I'm going to take this first section of the results section and show you how I take this rudimentary outline and begin to pull it apart, make it bigger and make it more substantial. And hopefully as you'll, you'll see as I kind of develop this and develop some of the other paragraphs that it's a, it's a fairly easy way. I don't want to say it's easy, but it's a more direct way perhaps of generating a solid first draft that you can then go on and revise and refine and get into better shape. This first section of the results section, we're looking at Amplicon sequence variants, looking at 16S sequences within each genome that are unique. And the story I want to tell is basically that copy number varies by taxonomy, that we have wide variation in the number of copies in our genomes. The more copies that a genome has, the more variants it tends to have per genome. Also, we see that full length sequences have more variants than subregions. And as more sequences are added to the species, we actually see more variants appear. Okay, so those are kind of the four things. And as I kind of said that in thinking about it, it's very easy for me to get into and mode, right? I'm not a very good storyteller by nature, I think. I guess if I worked on it, I could get better. And so what I like to do as I'm kind of outlining this and fleshing out this outline is I like to put in in parentheses, you know, is this an and point, a but point or there for point? So what I'm going to start out with is to say to investigate the variation in the number of copies of the 16S gene per genome, as well as the intro genomic variation. Among copies of the 16S RNA gene. So I'm setting up the setting up the story, right? What I do. So this is an and, right? I obtained 16S sequences from the R&DB. Okay, and so again, I'm trying not to worry about typos. I'm also not trying to worry about formatting like italization or getting my jargon right. You know, normally I wouldn't say 16S sequences. I'd say 16S RNA gene sequences and I'd spell that out. But right now I'm trying to get the information out. And as you know, I'm great at typos. And so I'm trying not to worry about those typos just yet. I'm also not going to try to worry about specifics like specific numbers or values, or I'm not going to try to go and grab a link, you know, for like the URL for the R&DB. Because those things take, tend to take me away and I become like a dog and it's like squirrel, right? Or I'm like really, you know, distracted easily. And so some of the things I tend to do, we'll kind of turn off the internet. Literally, I'll like pull the ethernet cord off or I'll turn off my Wi-Fi and type. And it's amazing how much more productive I can be if I don't have those other distractions. So I'm trying to remove distractions, right? So I'm setting the stage, right? So I'm interested in the variation in the number of copies of the 16S gene per genome, as well as the intergenomic variation among copies in the 16S gene. So I went out and I got 16S sequences from the R&DB, right? That's my first bullet point, fleshing out my outline. And another and. So each genome had between, I'm going to say one, I don't know what it is. Probably something like mycobacterium tuberculosis would be a good example. So maybe I'll say like EG that and some other. I think it was like 20 something. I don't know what it was, but I'm not going to worry about it. I know I'm going to show the range, right? And I can put in then that taxonomy, that taxonomic name for that organism, right? Copies of the 16S are an aging, okay? Great. I found, and so this is another and. I found that as the number of copies of the 16S RNA gene increased. Within a genome, the number of variants of the gene. Also increased, right? So this is my other and, right? So as I'm looking at my checklist of things, I wanted to be sure to cover, right? I'm seeing that I've kind of taken on this first issue. More copies, more variants per genome is there. And maybe, maybe this one will refer to a figure. Again, I'm not wrapping this around figures. I don't know what figures I want yet. I know the story I want, but, you know, just like I can read a book and not have any pictures, I personally feel like I should be able to read a paper without having any figures. Yeah, the figures add something to the story, but they're not the story, right? And so maybe now I'll go into a specific and I'll do another and. And I'll say the average number of variants per genome was 0.xx per genome per copy number. And that's probably like some really clumsy word. Was that for full length sequences? And it was 0.xx for v4 copy this for v3 v4 and then that for v4 v5, right? So this gives a good overview of that first statement where we got the data giving a summary of the variation in the copies. You know what? And perhaps we might think of this, this third bullet as a kind of a but, but I found that the number of copies increase, the number of variants increased with the number of copies. Hmm, right? What am I going to do with that? Right. For now, I'll leave this with an and and maybe we'll come back in wordsmith things later. So what I'll do now is I'll put in a but statement and a therefore statement to try to pull this paragraph together to move on to the next step of the analysis. So I'll do but although, so although is a but word, right? Although bacterial species tended to have a consistent number of RN copies, copies per genome, I found that as more genomes were sampled, the number of variants grew. Okay. So not only do we have multiple variants per genome of a species, but when we look across many genomes of the same species, we have even more variants. So it's not that all equal, I have the same five variants. But as we saw, they have like a thousand variants, right? All right. So now we're ready for our therefore statement, right? What are we going to do with that? And you know what? Before I do that very that, I think I'm going to throw in a couple of examples to help bring this home for people. So for example, mycobacterium has one copy of the RN. Operon in its genomes, but across all, however many genomes there are, so I'll say all mycobacterium genomes in the RNDB, there were XXX variants of the 16S gene. Similarly, E.coli has seven copies per genome and five distinct versions per genome. And about I'll say about, again, I don't want to be too specific with the numbers. We can go back into the databases and sure up these numbers. But we know the story well enough that we can, again, talk about it. And that's what we're trying to do or what I'm trying to do in this first draft per genome. And when we consider all, I think there are 900 or so genomes in the RNDB, there were I think over a thousand variants of the 16S gene. Okay. And yeah, I've got all sorts of typos in here, but I'm really just worried about getting the story out. You might think of those typos like kind of little goofs or little, you know, smoothing the rough edges of a story when you're talking to somebody. You're not lying. You're just not, you're not getting a fully polished story, and that's fine. So therefore, these observations highlight the risk of selecting a threshold that is too fine when defining and I'll say OTU unit of inference, not sure what I want to call it at this point, because there is a risk because we may split a single genome into multiple OTUs. This does not make biological sense. Okay. And so what do you think the next paragraph is going to be about? It's going to be about using a distance threshold to avoid splitting a genome into multiple bins. But the risk is if we use too wide to course of a definition that we're going to be lumping together different bacterial species. And so this is what I do when I'm writing paragraphs for like a grant proposal, for a paper, for abstracts. If I was better and could think on my toes and something I'm really trying to work on myself is when I give a presentation that I think when I give a presentation I tend to get into and mode where it's I did this and then I did this and then I did this and then I did this. Instead of saying I did this and I wondered what about this and therefore I did this right and that would probably tell a far more compelling story but I'm just not quick enough on my toes and maybe I'm too lazy to rework my talks when I when I when I give a talk to think about this ABT structure. So again, I like putting in my ands and butts and therefores so I know that I don't have too many ands in a row. I don't have too many butts in a row and that I kind of tie everything together with a therefore. So what I've done is I've gone through and I've done this for all of the paragraphs in this paper that I'm working on. And again, this is an observation formatted paper and it's not a full length really long paper. Let me show you what I wound up with which is very close to what I have for this first result section and I'll show you what we do next. Okay, so here is the manuscript with my full outline and so this is my introduction my results section. You'll see I have three paragraphs or three sections here in my results section and then I have a conclusion section where there's one or two paragraphs. The other thing I've done is I've inserted kind of my budget for the number of words. So observations in M sphere, which is my target, I think is 1200 words and two figures. And so I figure I want the meat to be in the results and so that's going to be out 700 words. And then my intro and my discussion I'll make about or conclusions about 250 words each. So a nice a nice paragraph and I think that will work well and you'll see again that this was that first paragraph that I worked with you all on and I've cleaned it up a little bit maybe reworded things a little bit differently. But you can see I've got and and and but therefore right. And you can see here that I've got perhaps another and statement. A method to avoid splitting a single genome into multiple units of inference is to cluster six sequences together that are similar. Right. That's kind of a statement of fact. But this risks lumping things together that perhaps we biologically don't think should be lumped together. Therefore I assess the impact of the threshold for lumping and splitting six sequences. Right. And you can kind of see as I go through this what things look like. I've also got my conclusions my and but therefore as well. And and so that all I think looks pretty good. OK. Hopefully I'm not giving you a headache scrolling all over part. So there's so much I could say about how I like to write introductions how I like to write results and how I like to write discussions. Maybe one thing I'll say about each for an introduction I think of the intro as a funnel where I like to start broad and then bring the reader in to my specific problem my specific question. Right. So sequencing is a very powerful technique for describing and comparing microbe communities. Right. That's it's pertinent to the question that I'm interested in. It's broad. And then I'm going to bring this in really tight to the goal of this study was to quantify the risk of splitting a single genome into multiple bins and the risk of lumping together different bacterial species in the same bin and then perhaps I could say depending on the threshold that I use to define an operational taxonomic unit. Right. So starting broad identifying problems and driving a narrative forward to a very specific problem that I'm going to address in my study. In the conclusion and again this conclusion section is much shorter than a normal paper but I think the idea is still there that we start with a overall synthesis of the story so far. You don't want to restate all the results but you restate the synthesis or for the first time that you synthesize the results the results of this analysis demonstrate that there's a significant risk to splitting single genomes into multiple bins if you find a threshold is applied to finding out to you and we have this ongoing problem of what is a meaningful taxonomic unit of inference. Right. And so you kind of synthesize what you've said so far and then try to put it in kind of the broader context of what's going on in the field. How does this result advance the science further and maybe in a normal paper you'd have three paragraphs of that and then you would close with maybe you'd also have a paragraph with some caveats of like well you know these are the kind of the pitfalls of our analysis and something that we should look at later that's again like another but right and then the therefore paragraph the last the last statement in the paper that you're going to leave the reader with of why this was so important right and and so here I have surprisingly three percent performs pretty well for the operational definition that limits splitting of bacterial genomes and avoiding the lumping of bacterial species right so all of this is in flux and if you look at this and you look at the initial raw rough outline I had you probably saw things changed a little bit here and there and that's fine right like we're not writing in stone tablets we're writing in electrons here and I can very easily copy and paste and move things around and so I don't want to get too wedded to whatever I'm writing as as I go about doing this I've also kind of done a rough outline for my materials and methods and I've started thinking about my acknowledgement section okay so some of these sections you could see very easily lending themselves to becoming a paragraph right and so you know depending on how well um and not no well is not the right word but how much I've I've put thought into the actual wording of a sentence versus just um spitting out text to kind of put ideas down on the screen I can begin to concatenate sentences together to become paragraphs for example let me take this first paragraph of the results section again that I can remove this hyphen to investigate the variation and the number of copies blah blah blah we obtained reference sickness sequences from the rndb um and then I can remove that and each genome had between one and blah copies of the operon I found that as the number of copies of the success gene increased the number of variants of the gene found within each genome also increased hmm intrigue and within an individual genome there are an average of however many variants per copy of the full length 16s RNA gene and an average of whatever whatever and whatever variants when considering the v4 v3 4 v4 v5 regions of the gene period although a species tended to have a consistent number of 16s RNA gene copies per genome the number of total variants increased with the number of genomes that were sampled okay um and I might think you know I get two figures so maybe I have figure one here in this first paragraph and this would be like figure 1a and this would be figure 1b and I can then come to my examples for example tuberculosis generally has one copy of the gene per genome but across these genomes that were in the database there are so many versions of the gene similarly an E. coli genome typically has seven copies with between so many and so many distinct full length sequences per genome across all the E. coli genomes that have been sequenced I'll put an E. coli genomes that have been sequenced there are blah blah blah different variants of the gene therefore these observations highlight the risk of selecting a threshold for defining units of inference that is too narrow because it is possible to split a single genome into both bins which does not make biological sense okay and as I'm writing this you can still see that I struggle with language what do I want to call these things do I want to call them ASVs, OTUs, units of inference I don't care right now I will have to care before I submit the PDF before I submit the paper but for right now I'm just trying to get a rough draft and bam I've got that right here and so you can see that I went from a very simple outline to flushing it out thinking about this and but therefore structure and then piecing it together and a single paragraph all right so you'll see that I now have an introduction so this is that that opening paragraph of my introduction it's telling me that I've got 308 words so it's a little bit long I'm okay with that again we'll do some more editing later actually when I look at other observation papers from mSphere that are online they're never under 1200 words and I think I found one that was about 2400 words anyway this is that paragraph that we were working with earlier and so you can see that this is this is looking like a paper right it's a short paper and I've got lots of practice doing this I've done this before but you'll see it comes together and it's right at about 1200 words if I highlight everything Adam down here in the bottom will tell me it's at 1264 words so I'm happy with that I've got a materials and methods section that I've put in my general approach to writing this is I'll write the results and then I'll write the discussion the intro and then I'll write write the methods and materials section and then later I'll do the abstract and important section and I think I'll save that for a future episode okay so this is my first draft of the paper is it perfect no would I submit this no would I want you to read this in detail no because there is going to be errors there's going to be things that as I write I think about and I want to come back and reassess there's also you know a lot of kind of generic language in here that I need to come back and I need to find the actual values for so actually in the next episode I'll show you how we can use our code to fill in these what I have is X's my placeholders with real numbers okay so I love writing and sharing what I'm doing I've been really excited to see the feedback and response I've gotten to the previous two episodes about how I write and thinking about reproducibility that clearly resonated with a lot of people so if you have a new year's resolution for being a better writer hopefully these helped one other tip that I would give you on top of disconnecting all your internet and everything and removing all your distractions would be just to set a timer and there's some Japanese term for this that's like the word for tomato but you know set your timer to 15 minutes and write and and try to remove all distractions sit down and write and you'll find that even if you're the busiest person that you can write in 15 minutes and you'll get to the end of 15 minutes and you'll be like wow I was really going I can do another 15 minutes to tell you the truth this was actually really hard for me because I started working on this fleshing out the outline and writing it on the Monday after New Year's when it's just like uh I just went through like a thousand emails I really don't want to write but I had my outline and I took it piece by piece 15 minutes by 15 minutes and here you go I've got a first rough draft so I'm really excited to be able to share that with you and share this process with you so thanks again for watching and I really appreciate that I know you're here because you're a smart person and you want to learn how to improve your confidence in applying these skills of writing reproducible research and and and becoming a better scientist so stay tuned for the next episode where we'll talk about how to use our code to insert specific values from our data into a manuscript this is one of the things that I think is just so cool about our markdown and really has changed how my lab writes its manuscripts so tell your friends about Code Club be sure that you've liked and subscribed this video and we'll see you next time