 If you spend any amount of time reading scientific papers, it's likely that you'll quickly notice there are two different strategies to writing the captions that accompany a scientific figure. In the first approach, the scientist will merely state what the figure is. It's very passive, it doesn't tell you much. In the second approach, the scientist is much more active in telling you what you should see in that figure. Which do you prefer? Well, today we're going to use that second approach where we're going to tell our reader what they see in the figure as well as giving a description of what they need to know about the data. Hey folks, I'm Patch Loss and this is Code Club. In each episode of Code Club, we apply principles of reproducible research to a biological question. Again, the biological question really isn't that relevant to the overall issue of learning these strategies to improve the reproducibility of our research. Today, we're at the point of our analysis where we're working on the manuscript and we're ready to put in some captions, some figure legends for our manuscripts to describe the figure that we're presenting. We also need to do some other modifications along the way so that we have our figures in the right place and in the right format. I'm going to go ahead over to my project root directory here and I will fire up my RStudio session. And I am now going to go to my manuscript.rmd and down here at the bottom I've got my figure captions. And again, what I put in here were really just a rough start because no one likes watching another person type. I've already gone in and pounded out some text to make some figure legends for these two figures. Let me go ahead and grab each of these two figures. So the first figure is esv-rate.pdf. And you can see this the figure here. And again, what we see is the number of genomes on the x-axis. Each dot represents a different species. The y-axis represents the number of ASVs that were observed per copy of the RN operon. And then we've got the four different regions depicted with the four different panels. So what did I say about this? The ratio of the number of distinct ASVs per copy of the RN operon increases for a species as the number of genomes sampled increases, okay? So this is what I want you to see in the figure, right? That as we sample more genomes, we see more ASVs per copy of the RN. It's not like as we sample more, we're plateauing. It continues to increase. And so then that's the declarative statement. That's what I want you to see. Now, the other information that the reader might need is also here in the figure legend. Sometimes people write entire paragraphs as figure legends. I'm not such a fan of that because a lot of that information is also over in the methods section. And I don't know that it really needs to be here and written out so verbosely. I guess there's a bit of a trade-off between like how independent should each figure legend be, each figure be. So if I pulled out that figure and looked at it on its own, should I be able to interpret that without having to read the rest of the paper? I mean, I think there is a trade off there, but I don't think you need to embed your method section in the figure legend. So what did I say? Each point represents a different species and is shaded to be 80% transparent. So that was the alpha equals 0.2. So that when points overlap, they become darker. The blue line represents a smooth fit through the data, okay? Very good. And again, the declarative sentence, the initial part of the figure caption will be bolded. And if you look at papers published in M-Sphere, which is where I plan on submitting this, you'll see this is the general format that people are using. For figure two then, as the distance threshold used to define an OTU increases, the fraction of genome split into separate OTUs decreases while the fraction of species that are merged into the same OTU increases. Let's look at that figure so we know what we're talking about. That was the lump split PDF. So again, as we increase the threshold as we go across the x-axis, the fraction of genomes that are split apart decreases. And the fraction of genomes of species that are merged together increases, right? So what else do we say here? These data represent the median fractions for both measurements, the fraction split, the fraction lumped, across 100 randomizations. I don't think it likes, I think it likes randomization but not randomizations maybe. I don't know, I have that red thing underneath it. In each randomization, one genome was sampled from each species, okay? So again, looking at this figure, I think that, I think that's a pretty good caption. Again, it's declarative what I want you to see for this. And again, I don't know that the variation between the four regions is really all that big of a deal. But anyway, so I think that works pretty well. For figure S1, the supplementary figure, this is the rock curve. If I go ahead and open up rock curve PDF, I'll make this bigger for you all to see. Distance thresholds larger than 3% provide better sensitivity and specificity when assigning ASVs to OTUs when trying to represent species level classifications, okay? So again, what this goes back to something that's stated in the methods and in the results section of how we define sensitivity and specificity, and that we have a 3% threshold here on our rock curve. And that if we look at close to optimal classification, that's again close to this point, perfect specificity, perfect sensitivity. And then this is equal sensitivity and specificity, what threshold gives us that. So again, this is a supplemental figure, because I don't know that the sensitivity and specificity values themselves are so critical, as much as the threshold that gets you there. And those thresholds are in the results section paragraph. This is a supplemental figure, again, showing people what the rock curve looks like. So what else do I say? The sensitive reason specificities for ASVs were blah, blah, blah, blah, for the V1, V3, V3, V4, V4, V4, V5 regions respectively. And I say that because that's way down off the screen, because if we zoomed out, then we wouldn't see as much separation up here. And all we would see was that ASVs are kind of crappy at depicting species. So we need to plug in these values. And so I feel okay about zooming in as long as in my legend here, my caption, I indicate those sensitivity specificity values. So we'll have to come back and get those. The gray diagonal line represents the position where the sensitivity and specificity were equal. Again, that's this gray line. These data represent the median fractions for both measurements. For both sensitivity and specificity, I think I copied and pasted from the previous one across 100 randomizations. In each randomization, one genome was sampled from each species. Okay. Good. So again, we can then make the submission manuscript PDF. I can open up my manuscript. And if I scroll through the bottom here, I now see these expanded figure legends, figure captions, whatever you want to call them, that go with each of my plots. And I think they look pretty good. This one goes on two lines. It's not the end of the world. I suppose I could make this one a little bit smaller, but whatever, it doesn't matter for the figure caption. Although that kind of annoys me. So maybe maybe I will go back and modify that here. I think ultimately, you know what, no, I'm going to let it go, because what's important here is that this is the figure size that's going to be published. And there's no need to shrink it down any further. I think that will be, I think that'll be okay. All right. So I won't worry about that. So what I do need to worry about though, is plugging in these values where I currently have X values. And we've already read in the rock curve data up ahead. And if I come back way up here, the sensitivity specificity, I'm going to go ahead and run all the code chunks to this point. And then I will run this code chunk so I can get the sensitivity specificity data frame to then work with further down in that specific in that supplemental figure caption. So I'll run this. And this now gets me my sensitivity specificity data frame, where again, I have my regions, my thresholds, my sensitivity specificity. I'm going to bring that back down here. And I will make a code chunk. And now I want to work up these values. Okay. And again, what I will do will be to say filter threshold equals equals zero. And I misspelled threshold, of course, because that's how I roll. So these are the four values. I want to make sure that my sensitivity and specificity are rounded to three digits to the right of the decimal point. Sometimes this value gets output and it's really like 0.995 or four or something. I doubt it's actually one, but we want to see what it is to three significant digits to the right of the decimal. So as we've seen before, we can do format, or we can do mutate. And I will call this sensitivity. And we will then say format, sensitivity digits equals three, and small equals three. And so if we look at that, we see that these values that this has now turned to a character column. We also want, we'll copy this down to make specificity. Right. And so we run that. And now we see that we do get three digits to the right of the decimal. And it's 1.000. Sure, it's not perfect, but close enough. We're not going to represent all the decimals out there. So the way I've described this would be that we have the sensitivity and specificity. And so with a diagonal or a forward slash between them, so I will now make this a pretty column where I do paste. And again, I said sensitivity specificity. So sensitivity and specificity. I'll do sep equals that forward slash. And so I now I see I have this pretty column. And that looks great. And I will call this ASV son spec is that and now I need to build out individual variables that I can put into here. And I will do v19 is ASV sends a spec. And that will be again, I'm going to do a filter for v19 and then pull out the pretty. So I'll do filter region equals v19. And then pull pretty. And so then if I look at v19, it should be the sensitivity and specificity for v19. Okay, so v19 34445. 34445. That looks great. And again, if I build those, I can now come back down here. And because I've already done the formatting with the forward slash, all I really need to do is our forward slash v19. And it will then plop that value in there. And here. And of course, I need to put in the actual value. So this will be 34. And I want to make sure I get the order right, corresponding with the listing of the regions at the end of the sentence. Okay. So I think that's good. And let's go ahead and build that. And we'll see if our figure legend is complete. And again, zooming in on this, we see that sure enough, we have those values in there. And again, they're bolded for now, just so we know where the output of the inline code is being inserted. Okay, great. So we I feel pretty good about these figure captions. And and I think that's that's pretty great. The other thing we need to do, of course, is we need to call to these figures in the body of the manuscript. So if I come back up to this, remember, there's three paragraphs to the results section. So this was the first here, and we already have a call to figure. And so I'm going to call this figure one, right, so that described as the number of copies of the operon genome increase, the number of variants of the ASV and genome also increased figure one, right? Actually, that's not there. That's describing this next sentence. Down here actually is the figure, right, although a species tended to have a consistent number of six and s copies per genome, the total number of variants increase with the number of genomes that were sampled figure one. Okay, good. So we've got that. The other thing that's important to keep in mind is that the figures need to be presented in order. So we don't say figure two and then figure one, we do figure one and then figure two. I know that's obvious, but you'd be surprised how many people slip up on that. All right. So down here, then we have a method to avoid splitting is to cluster sequences together that are similar. All right. All However, this also increased the risk of lumping genes together from different species that were similar to each other. And so we can then say figure two, right? And I'll put this in bold, like so. And I think that's good, that sets up the rest of the paragraph. And then hopefully you can see that the rest of the paragraph describes some of what's going on in the rest of the manuscript. And I think actually I'd need this further down like that sentence is kind of describing something that I think is a little obvious. So now we need to put place figure two. And I think figure two will come down. Let's see. So at thresholds of blah, blah, blah, blah, blah, blah, that's where we then saw one O2u for genomes that had seven copies. However, at these higher thresholds, multiple species could be represented by the same O2. At the highest level of resolution, shared a sequence variant with another species blah, blah, blah. At the commonly used 3% threshold of the species shared an O2 when considering full length sequences and blah, blah, blah. Okay. All right. And so what we can do is I'm here I'm going to put figure two with my double star to make it bold. And at the high level of resolution, yeah, so that's good. Now we need to insert the call out to supplemental figure one. And let's see, talk about the confusion matrix, all that kind of our methods. And so we talked about the receiver operative characteristic curve. So here then we can put in figure s one. Okay, and we'll save that. And so now we've got the call outs to all of our figures. And we are then ready to build that and take a look at the manuscript. And if we come back up to our results section here, we'll see that we have the call out to figure one where so figure one is here. Figure two. And then supplemental figure s one figure s one. So again, besides getting the numbers in the right order. The other thing is to make sure that all of your figures are actually included. There are some more fancy things you can do within the tech that's kind of frankly beyond me, so that you don't have to really worry about the numbering of your figures. And so that that's all done automatically, so to speak. I don't know how to do that with three figures. It's not that big of a deal. This is a pretty simple paper. I'm not so concerned about making those links. But know that if you do a little bit of digging, you can figure out how to make a link between the figure number in the text and the figure at the end of the paper. So the next thing that I want to do is return to our make file. And we had some problems with our files with our tiff files just being gigantic, like close to 10 megabytes. And that for whatever reason, it wasn't compressing the file in gg save. And what I'd like to do is I'm going to I also need to to kind of port over those files. So I'll do some to the manuscript directory. So I'll do or to the submission directory. So the submission figure one dot tiff. And I'm going to say that is dependent on figures. And I will then say ESV rate dot tiff. And I'm going to use something from a package called image magic called convert. And if you do convert help, it's gives you a very large readout of documentation. But this is image magic. You can get through this through homebrew on a Mac. I think if you do brew install image magic, you'll get it. The command we're going to use is called convert. And it allows us to compress our data, compress our image file. So we can do convert, compress, and we will then say LZW and the input to the output. Right. So I could use the the match the automatic variables for make but I bother. And so we will do this for our other two figures. And so figure two and figure s one. And so this is lump split dot tiff. And then this is rock curve dot tiff. And I'm then going to make this submission figure one, figure two, and figure s one. And and that's going to be tiff. So I'd like to see if we can get rid of having those PDF versions of the files. And let's go ahead and make one of these first just to make sure it works. And we will do make submission figure one dot tiff. That ran so if I do LS LTH submission figure one dot tiff, and then figures ESV tiff. I see that yeah, the compressed version is like 500k versus close to 10 megs. And if I open those, we'll see that the figures are the same, right? So there's no distinction between those. So that worked nicely. And I can again do make that and figure two dot tiff. No, not one, two. And submission figure s one dot tiff. And so that'll build all those. And again, if we look at submission star tiff, we can see, I think I forgot to do something. It all worked. Ah, I should have used the automatic variable. Maybe I'll go ahead and do that. So I'll do the dollar sign at because basically, I copied and pasted, but didn't update it all the way down. So again, to get the target, it's dollar sign at. And to get the dependency, the prerequisite, we can do dollar sign less than. And that should work. And so if we make all those things, that builds those out. And now if I look at my tips, I see they're all quite small. And what I'd like to do is again, if I come back to my figures, I want to see if I can put in figures, or let's do submission. Actually, I don't need to do submission because I'm in submission. So I should be able to figure s one dot tiff. And then this will be figure two dot tiff. And then this will be figure one dot tiff. And let's give this a go. Hopefully the tips can be embedded into that file. We can do make submission manuscript PDF. So it balked at the tip. So anyway, it's not happy with that. I'll go ahead and I'll go ahead and put those PDFs back in there. That's fine. Again, this is only for the initial submission that we can include the figure, the figure inside of that inside the our markdown document. And yeah, so I will also what? I'll go ahead and add those targets to this for now. So we'll do figures. What was it? ESV rate dot PDF figures lump split dot PDF, and then rock curve PDF. Okay, so now when we build it should hopefully work. So that worked. And again, if we look at our manuscript and zoom in a bit. So it's easier for you all to see. You can see that the figures are embedded in here. And they look they look good. Okay, so again, this this all should work. I don't know why gg save didn't work with the compression that still is a little bit frustrating to me. But that's fine. It's awesome. I'm sure there's a way also to embed tips inside of a PDF. I just don't know it. But again, this is good enough. This works. I don't want to shut that. But again, if we look at our manuscript, we're getting there, we're getting pretty close. Everything is here, except for the abstract, the important section, and the references. So we need to insert references, we need to abstract the importance, and we need to do a fair amount of editing before we're ready to submit the manuscript. But that's what we're going to do next. I'm not sure what I'll do next, whether I'll insert the references or I'll write the abstract and important section. And needless to say, when we insert the references, it's probably going to be a bit different than you're familiar with inserting references into a manuscript. But it will work well with our markdown. And it will. It'll be it'll be pretty slick. You'll see. Anyway, I hope you found this useful in thinking about again, how we write figure legends, the different strategies you can take. I know people are of two minds of whether or not to be passive or more active in how they describe their figures. Again, I prefer active. People think data speak for themselves, but the data do not speak for themselves. We give voice to the data. And so that's why I think it's important to give a more active presentation of the figure in those figure captions. Anyway, let me know your thoughts down below. Are you passive or active? Do you write a whole paragraph of text for each of your figures? What what do you think and what's what's your philosophy? Do you just kind of go with whatever people around you do? But anyway, hopefully you found this discussion demonstration useful. Keep practicing. Please be sure that you've liked this video so that I know what you find interesting versus what you all find boring. I don't really have a lot to go off of if you don't give me some interaction here. Anyway, please tell your friends about Code Club and we'll see you next time for another episode.