 One of the things I find most tedious about scientific writing is entering numbers into text. I don't trust myself. I know that I'm likely to transpose numbers, goof things up, use calculations that I didn't really mean to use. Well, in today's episode of Code Club, I'm going to show you a way to avoid that, using inline code in an R Markdown document. Hey friends, I'm Pat Schloss, and this is Code Club. In each episode of Code Club, I try to apply principles of reproducible research to an interesting biological question. Over the past 50 or so episodes, I've been trying to investigate the sensitivity and specificity of Amplicon sequence variants, also called ASVs, and operational taxonomic units, also called OTUs. Well, we're at the point we're writing our manuscript, right? And if you've watched over the past few episodes, we've seen that we've expanded from an outline into a rough draft. And as I write, I don't insert numbers and I don't insert specifics, because I find that going and finding those values really slows me down. Well, today we're ready to start inserting those specifics. And we're going to show you how you can use our code in an R Markdown document to insert literal calculations into the text, so that when you see a number, you know that you haven't transposed a number or screwed something up. Also, it provides a level of transparency so somebody can come back later and say, Hey, Pat had this number of 23,000.56. Where did that number come from? And you can take that number all the way back to a raw data file. Sound exciting? Well, I know it's not that exciting to everyone. But for me, it's been huge for revolutionizing how my lab does its research and reports our values, because I feel like it makes it so much more reproducible. So we're going to dig into this draft we've been working on. And I'll show you how we can insert inline R code into the manuscript to populate these values. So we're going to go ahead into our project root directory. As we've done so many times in the past, I'll go ahead and open up my R project file in our studio. This will pop us into our current working directory. If I go to my files, I want to go to my submission, and then manuscript.rmd file. So we are kind of where we left off. After the past two episodes, again, I've got my code, my text here. You'll see that in here, I've got some x's for what I want to plug in specific values. And so this all looks good, like we remember it. We might not get through all of the text today. Also, sometimes what happens is I start looking at the text and looking at my analysis and realizing I don't actually have the data that I thought I had for the manuscript. So we might run into that either in this episode or the next episode, but we'll get it squared away. And so what I'm going to do is at the top of this document, I'm going to create an R chunk. And we've seen this before when we talked about making those exploratory data files. And that's with the three backticks and set of curly braces. And we close out a R chunk with three backticks. And inside the curly braces, we put R, right? So in here, I'm going to put library tidyverse, library knitter, and library here. I remember that library here, the here library helps R to know where we are, our studio or sorry, our markdown to know where we are. And we saw that previously in our, in our exploratory data analysis. So I like to put my code chunk that I'm generating the values out of close to the paragraph where I'm going to use those values. So I'll go ahead and create another code chunk down here. And I'm going to need to read in my metadata and my ASV data and then join those together. I've already done that a few times in the exploratory data analysis. So I'm going to actually go to my exploratory. And I think I had that a good one down here. Kind of when we were doing the lumping and splitting. Yeah, so this looks good. I'm going to go ahead and copy this back to my manuscript RMD and give it the once over to make sure everything looks good. For now, I'm not going to worry about only selecting one genome per species. And actually, I'm going to get rid of that. As well as that final pipe, we might come back and revisit that later. So ESV and then metadata ESV, we're going to join those all together. And if I run these, let's see, could not find function read CSV. And that's because although I put it up here, I didn't actually run these library commands. Okay, so those all look good. And then if we come back down now, I can run these. And that's going well. And let's see, I want it in the console, not inline. I'll remove that output. Okay, so again, the first question is, how many genomes are represented in the R&DB? Right. And so what I can do is that we have metadata. And as you recall, metadata is has a row for each of the genomes in the data set. And there is like 15,614 genomes, right? So we need our to tell our text that are to insert that into our text. So what I can do is that I can replace that series of XXXs. And I can put in two back ticks. And inside the back ticks, I put an R space. And then I can do n row metadata. Okay. And so what that will pop in is 15,614 into that slot. Okay, so to show you what that looks like, I could knit it here. And let's go ahead and knit it. That's fine. This will generate a PDF version of the document. It takes a moment, but it goes pretty quick. And this is outputting it as a word document for some reason. That's fine. For now, we'll get that figured out. You'll see my code chunk here, which I don't really want. But lucky here, we've got 15,614. Looks good, right? So that's effectively what we want. So to get rid of outputting my code, I can tell our markdown to not echo my code. So I'm going to go ahead and close that document. I will also look at my output here. And I'm going to, yeah, that's right. So I did both word document default and PDF keep text true. Okay, that's fine. So what I can do is at the top of my code chunk, I can say echo equals false. And that will keep our markdown from outputting the code. Okay, so that's good. And again, we've got our first bit of information inserted into our text, which is awesome. So the next thing we want to know is each genome had between some number, probably one, and some other number copies of the RN operon. I think what I want to, as we do this also, we realize that maybe we want to put in, we want to word things differently, or we want to calculate different values. So one thing that I want to put in here is perhaps the number of species. And so perhaps I could say, among the, you know, XXX species represented in the R&DB, there were however many genomes. And then I can say period, each genome had blah, blah, blah, okay. Species. And so I need to, I need to plug that number in here as well. And so I can get that by doing, again, back ticks, R space, and then I can say, and distinct, and that will then count the distinct number of species in metadata. So I can do metadata, dollar sign species, that dollar sign is a way to get a column out of metadata. And I'm going to go ahead and change things a little bit and do make submission manuscript.pdf. This will run it through the terminal using our make and doing it this way, it'll render as you recall both the word version and the PDF version. And so I can do open submission manuscript PDF. And we see there's a little bit of output up here that I probably want to turn off. But again, if I scroll down, I see that I have 4,774 species represented, there were 15,614 genomes. Awesome, right? So if we get a new version of the R&DB, we run it rerun everything, it will update those numbers for us, I don't have to worry about, did I update the numbers or not, it's going to be automatically generated from our upstream data. The other thing to notice again, is that that code chunk is missing up here. So I also want to turn off the output, the echo in this top of this top code chunk, right? So I can say echo equals false. I'll run that. And again, if I not open it, but remake it, then it will hopefully get rid of those library function calls in the top code chunk. And again, that goes pretty quickly. And I see that that's gone now, right? And you would never know that there was R code baked into this document. It's so awesome. Okay. Now, let's scroll back down and find the next thing that we need to take care of in our text. And so that is each genome had between x and xx copies of the R&D operon. And so we need to figure out what is that range, and then what are the names of those organisms. So what we can do is I'm going to take metadata EASV and we will pipe it. And so I'm going to work this out kind of like I would if I was working in an R script and trying to do some type of analysis. So then I will do filter region equals equals v19 threshold equals equals zero. This will give me the full length sequences and that threshold. And I want to know the number of copies in each genome, and then kind of the median for each species. We could use the range. But I don't know, I think I'd rather use the median, because sometimes there might be outliers that aren't indicative of that species. I don't know. Let's see what happens if we do the median instead. Or maybe we could do the mode. Let's start with the median. R actually doesn't have a mode function. So you might think about why doesn't R have a mode function. But we'll save that for another time. All right. So again, what this is going to look like is that we've got our genome ID, our species, and then we've got region threshold. And we've got our count up here in our ESV up there. And we want to know the number of copies per genome by species. So we will then do group by species genome ID. And we will then we want to count the number of copies to the number of RN copies, right? So we'll do summarize n RN equals yeah, some sum on count. And again, if we run that, we see that we've got the species, the genome ID, and the count. We can also then do that dot groups equals drop. And then I can do group by species. And I'm going to output ranges. So I will do summarize min copy number equals min, and RN max copy number. And then let's do median, copy, no equals median. Man, those typos are coming fast and furious. And I'm going to put in an n column to know the total number of genomes for that. For that species. And then we can do summarize groups, sorry, groups equals drop. And it looks like I added an extra parentheses up here. All right, so if we run that, for each species, then we get the minimum copy number, the max copy number, the median copy number, and the n. And that is good, maybe I'll get rid of these copy, no things because it's going to be obvious that this is a data frame of copy numbers. So again, that makes it a little bit easier to look at. And we would like to look at the min and the max. So I'm going to call this copy number, RN copy number, we'll run this to generate the data frame. And then this is RN copy number. And I want the minimum, right? And so I'm going to do top n, and maybe I'll do minus one for my n. So that'll give me the smallest by the median column. And that gives me a bunch of things, of course. So maybe what I will do will be to do filter, median equals one, and then pipe that out to give me the one with the most genomes in it. I've got an extra sign there. And so we see that mycobacterium tuberculosis has one copy, 180 genomes, every genome has one copy. And that's a good example to use, because it's the most sequenced species that only has one copy of the 16s genome. And so we'll call this single copy. Okay. And then let's find the max copy, which I think I'll do kind of the same idea, but maybe maybe this will work. But we did try to do before, where I do median. So let's see, what's this equal? So this is metabyssalis literalis has 19 copies in it. And so that, that looks good. And then let's use another example, or that's that's the range, right? So single copy max copy. So what I'd like to do then is to put single copy down in here. So this would be like our single copy. And then what that we want the the species name. And maybe I'll put dollar sign species. Or no, that's the range, right? So this is what goes in here. And this then would be median. I think yep. And then down here, we'll put the same thing. Except instead of single copy, we're going to want max copy, right? Max copy dollar sign species. Let's give that a shot and see what we get. So while this is running a good reminder to be sure that you've liked the video and subscribe to it so you can see the rest of how this manuscript plays out. Let's look at our document here. And if I zoom in, so we see that each genome had between one, e.g. mycobacterium tuberculosis and 19 metabyssalis astralis copies of the RN operon. And so maybe I should say that each, each species had between, um, let's see, let's say each species had between, or the middle, the median number of RN operons per species ranged between. So that kind of reflects a little bit better what we're actually calculating, right? All right. So next, on average, there were so many variants per copy of the full length, um, six nest gene and an average of blah, blah, blah, blah, blah, when considering all these other regions, right? So what we want to do is come back up here. And now what we're going to look at is our, um, let's see, our metadata EASV. And we want to look at not by genome or not by species, we want to look across all genomes at the number of, by genome, we're going to do the number of ASVs divided by, or and the number of copies. And then we're going to add that across all of the genomes within each region. So, um, again, we're going to do filter threshold equals equals zero. And again, we're doing that because we're still focusing on, um, like kind of the true pure Amplicon sequence variant without any clustering, kind of an exact sequence variance, if you will. And we will then do group by region and genome ID. And we can do summarize. And we will then do, um, let's see, N, RNs, N ASVs. So our number of RNs is going to be some on count. N ASVs will be N distinct ESV. And we can then do dot groups equals drop. Give that a run. And so we should then get for each region, each genome, the number of copies and the number of ASVs. So this first one had nine copies of the six chance gene and nine different variants. All right. So now we're going to go ahead and group this not that direction. Yep. By region. And we want to do a summarize. I'm going to call it rate. And we'll do N ASVs. I need some N ASVs divided by some and RNs. And we'll again do dot groups equals drop. Give that a run. And this should give us our four different regions. Excellent. So, um, I'm going to call this rates. And I will then say, um, let's let's make a v one nine rate v three, four rate, v four, eight and v four, five rate. And I'd rather do that up here in the code trunk, rather than down in the text, because all that, you know, long, long lines of code really gets hard to edit when you're trying to check your code and makes it hard to read the art markdown harder. It's already going to be hard with this code, but it's going to make it a lot harder with that information baked in there. So if we do rates, filter, region equals equals view one nine. And then you could do a poll rate. And what this then, uh, because I forgot to load that, get that, then we get 0.597. Okay. And again, the poll gives you a vector output rather than if we use select, you get a tibble back, right? And so I'm going to do a poll to get the actual value. And I'm going to then put in v four, three, four, four, five, and I'm going to call this rate, view one nine, view four, my hands are cold and more of far more typos than normal. So now I've got these values, and I can come back down here. And I can then say, um, back tick and put an R rate view one nine, right? And, and then I can pop that in here. And I can say v four, three, four, and four, five. And then I probably should put respectively at the end of that sentence. Okay, so let's run that and see what things look like. So the more code you put into your markdown document, the longer it's going to take to run. But, but that's still pretty quick. So one thing we notice is that we got the numbers in here, right? The downside is that we've got like tons of significant digits. So we'll come back later and talk about how we can clean up those significant digits to maybe just output like 0.59 or 0.597, 0.25, you know, and so forth. Because so many significant digits really isn't necessary. So we'll hold on to that, and we'll come back to that later down here. Now, we want to know. So we're talking about how right here, as you increase the number of 16 s genes in the data set in a genome, the number of variants also increases. So for example, m tuberculosis generally only had one copy of the gene per genome, but across the XXX genomes that have been sequenced, there were however many. And similarly E. coli had those as well. And I might also want to put that put it in that literalis of parabactyl basillus or whatever it was in there as well, to illustrate for a small number of copies, a medium number of copies, and a large number of RN operon copy numbers, how many variants were there? Okay. And so we're going to need to know the number of genomes, as well as the number of variants of ASV. And so we kind of already have the variants of the number of copies from up here, right? And so maybe I'll save that here. And if we look at that value, then we've got our species name in the left axis, or left column. So I could do filter species, what equals equals. And I'm going to, you know what, I'm going to be good. I wonder why I didn't like that. So my go back to him tuberculosis. So I'll do a single copy dollar sign species. Why aren't you happy? Oh, because I can't spell. All right, so my go back to him tuberculosis. And then what I want to pull is the N, right? So that's the number. So single copy n is that I'm going to leave that all as one line. We'll also do that max copy. And I don't think I need that really in here, because it's only one, I'm pretty sure that's one. And then I'd like to also get out E coli. So we'll do that. So E coli, copy n is that and then we'll do species equals and then equal I copy n 958. Okay, so I'm going to come back down and we'll do single copy n. Let's do let's do single copy, single copy dollar sign species. And again, I need that back tick R, right? Only one copy of the genome per genome. Let's see, and we can pop that in here, and we can do median, right? Single copy median is one. But across the and then we had very forgotten the name, single copy N. Let's see our single copy N. There were however many variants of the gene. And so that's how many genomes there were. But we haven't figured out the number of variants there are. Okay, so we still need to generate that. Where were you? So we need to figure that out still, right? So an E coli had let's do R E coli, copy N. And I suspect I didn't define that up about 958 between one and one. So I will then say, yeah, we have this up R E coli, copy median. I don't know that I defined that. So we'll be sure to do that. And R E coli, copy dollar sign max, I think I meant min here. And then across the R E coli, copy N E coli genomes that have been sequenced there are however many variants of the gene. Okay. And I think I again, I need to define E coli copy up here. We're going to do our N copy number. I already did that. So what's E coli copy N. So what I'm going to do is I'm going to do E coli copy. And let's leave out the poll N for now. And so if I look at E coli copy, that's going to give me my min meet max median and N. And did I screw something up there? No, okay. I just forgot to delete that extra thing. All right. So E coli copy, I'm going to come down and do E coli copy min, E coli copy max, and then E coli copy dollar sign N. Let's save that. And we can then build that up. I must have left E coli copy N in here somewhere. So similarly, E coli typically had E coli copy dollar sign median on that. Hopefully we don't get any errors this time. And again, we look at we see that we've got the numbers populated in here across the 950 E coli genomes, however many variants. And maybe I'll add a sentence that the max copy dollar sign species genome had, let's see, had however many. So this is going to be max copy right of the gene per genome, but across but had, or I'll say and had versions of the gene in its genome sequence. Great. So now all we need to get are the number of versions. So I know this is going a little bit long, but hang with me, we've got this one last bit to fill in. Clearly, we're not going to get to the second paragraph today. That's fine. So what we need to get is the number of ASVs by species. And again, we'll take our metadata ESV. And we want the total number of ASVs for that species. So we'll do that. And then we'll do group by species, and then summarize, and ASVs equals some count, right? And so we'll say ASVs per species, we can then do our dot groups equals drop. And if we do ASVs per species, you know what, I want v one nine. So I'm also going to do add region equals equals v one nine. But I do ASVs per species. And then I do filter. Let's do species equals equals Escherichia Coli. I've got 6713 ASVs. I didn't want some count. I wanted an distinct ESV. And that gives me 1013. Yeah, ASVs. And so this is going to be equal I is that. And I'll do a poll and ASVs. Great. And then we also want the single as well as the max copy. So I'm going to borrow that code and come down here. So it's equal I and ASVs max and ASVs. And again, I'm kind of reusing the code that I've already got. And I'll pull and ASVs on all these. And so if I look at the single, there's 11 copies among those 180 mycobacterium genomes. And then max and ASVs has 17 copies in those 19 versions. And you know, the more I think about it, I don't know that I really want that max sentence that I added. Because the thing about the mycobacterium and the coli is there's a lot of cop, there's a lot of genomes that we have in the database. So I think I'm going to leave that out. Yeah, when I think about it, I think I'm going to remove that. And what I can do then is I can put backslash r. And this was then single copy. I'm sorry, single and ASVs. And then this will also be let's see, this was E coli and ASVs. And I think I forgot to once I've defined it, I forgot to run the line up here. And let's just double check that those look good 2013. And then the one for my mycobacterium was 11. Okay, so I'll go ahead, build this should look good. Kind of rushing through doing this first paragraph. Again, what I really like about this is this gives me the confidence that I'm not introducing typographical errors when I introduce the specific numbers into my manuscript that I have our do it for me. And I see that I've got you know, 2013 here, 11 there, very good. And that was one paragraph that we went through today to populate the values. Sometimes I get manuscripts to review from people that are using an R markdown document. And they've hard coded these numbers into the text that totally defeats the purpose of having an R markdown document. Remember, the value of an R markdown document is that you can have the markdown, but you can insert your R code in there to update it for you. So if between now and when the manuscript is finally accepted, we hope on the R&D be updates the database, I don't really have to touch the manuscript except to make sure that my conclusions haven't changed because any of the numbers have changed, right? But I don't have to insert or change any of the numbers, because those will get updated upstream, because again, this manuscript is going to be dependent on those upstream files. And so that reminds me that I need to enforce that. And I can enforce that, of course, by taking these two file names and making them dependencies of my manuscript. And so if I open up Adam for my project, and I come to my make file, where I am making my manuscript down here, I can add those dependencies, right? So if either of these files gets updated, then it will rebuild the manuscript, which is pretty slick. Okay. So again, I'm just going to do the PDF for now. Of course, when we build the PDF, it also builds the .x file. All right. So I hope this was exciting to you or interesting to you about how we can use our markdown to insert code into our text to then bring in those specific values. Any value you can generate in R, we've already seen names, characters, and numbers, you could insert p values, anything you can insert into the text using R. And so that's why this is so valuable for reproducible research. It's again, you can find that number in the manuscript, you can go to the R markdown document, you can go back to those TSP files, and you can keep going all the way back to the original source of the raw data. It's very transparent and very empowering. All right. So the next episode, we'll take on that second paragraph and maybe learn some new things. Till then, keep practicing, please tell your friends about Code Club. If this stuff interests you and how to use R, but perhaps some of the R syntax is a bit foreign, please be sure to check out riffamonis.org. I've got two full length, 10, 12 part tutorials on learning R using microbiome data, as well as another tutorial using other types of data. Please check it out. Let me know what you think. And again, we'll see you next time for another episode of Code Club.