 One of the big challenges that I face when I'm doing a data analysis is figuring out where to do the heavy lifting Sure, there's a lot of steps in an R pipeline that are run really quickly and they they don't take much effort at all to run There's other parts that might take hours maybe days to run Those are things that we don't want to run many times, right? So if something's really quick, I can run that a hundred times and not think much of it But if it's something that is a beast of an analysis step I don't want to run that a lot of times. So in today's episode, we're going to talk about that How do we divide our efforts? How do we divide our code to handle heavy lifting and then lighter lifting tasks when it comes to preparing our markdown documents? Hey folks, I'm Pat Schloss and this is Code Club in each episode of Code Club I try to apply principles of reproducible research to an interesting biological question over the course of the past 50 or so Episodes we've been looking at the sensitivity and specificity of Amplicon sequence variants ASVs and operational taxonomic units or OTUs a few weeks ago We are working through doing an exploratory data analysis with our markdown documents Well, now we're writing up our manuscript and trying to take code that we had in those exploratory Our markdown documents and put them into our markdown document to prepare this manuscript but There were a few steps in those exploratory data analysis that took a long time to run In fact, they took so long that we only did one rep of it instead of doing say a hundred iterations To proceed with preparing the manuscript. We need to go back and do that heavy lifting The question though is do we put that heavy lifting those steps that might take an hour to run inside of the our markdown document or we write that in a script outside of our markdown to Create a file that we can then read into our markdown for doing our summary statistics Or into another script that we can then use to generate figures. I Like to do all my heavy lifting in a script outside of our markdown I really like to minimize the amount of data processing that hap happens in our markdown document to me the our markdown document is about presentation and so if I need to Say bold a word or italicize a word or create a section heading or insert a line break or page break I want to be able to generate that pretty quickly right I want to focus on the formatting not on the compute the compute I'm going to do outside of that our markdown document So then when I render the document it renders rather quickly and I can really focus on the formatting The text the send, you know what I'm writing correcting my silly typos and things like that So today what we're going to do is we're going to go back to a earlier set of exploratory Exploratory data analyses and create our scripts that will do those analyses Generate output files and then we'll see how we can bring that into our markdown document to finish off that result section That we've been working on in the past few episodes. So we will come to our project root directory And you'll see we're green. We're all good to go. And we will open up our studio project file Setting us up in our project root directory here in our studio as our working directory And we are in good shape I'm going to go ahead and create a new r script file and To remind you I'm going to come back to my github repository and if I look at the exploratory directory The last couple that we did so three so there's threshold to drop nas v's Not quite sure what that title meant and then lumping and splitting Have the code that we need to populate the values in that middle paragraph. So let me Remind us what that looked like. What is what is in that middle paragraph? So if I come to submission manuscript rmd Hopefully the scrolling doesn't give you headaches. There's this middle middle paragraph here, right Where one of the questions we want to know is what threshold would we need to only have one asv or one otu per Species or per genome in that species Another was if we use that level of resolution that threshold then how many of how many species would share An asv or otu across those species And so this let me look at the md so we can see the the figure is embedded in here That here for a variety of copy numbers And the different thresholds that were used in each region You know, how many Asvs did we see for 95 confidence level, right? So if we look at 95 of the genomes that have seven asvs or seven copies of the 16s gene How many asvs or otus would we see and so that would kind of be like right around here, right? I guess that'd be seven and a half here to be about seven and so we can see that The v19 region to get down here. You would need something on the order of about 0.01 0.02 0.03, right? And actually in this analysis, we did this where we looked at a threshold. We looked at four copies And we then looked at the number of Asvs per species that we would see For different thresholds of defining an otu. Okay So this is the information that we need to write into a script to output a big data file that we can then use to parse To populate that paragraph as well as to make our own figures. So we have a lot of work to do today So i'm going to start with this second question and come back up to the top of here and i'm going to shamelessly steal my own code and copy that into my R script and i'm going to get rid of the here and knitter package calls. I don't need those And i'll remove the here as well. I'm writing these paths relative to the project root directory And so I don't need to hear for that within an r script and I accidentally I hit the save so i'm going to go ahead and put this Into code And I will call it get a threshold for single otu.r All right, and so I can then look at in here if I go to code So get For some reason I put it in the project root directory rather than in the code directory. I must have forgotten to uh Touch that so I'm going to go ahead and manually move that um to my code directory And so now if I look Where'd you go? um code And Let's see get threshold for single otu.r. We're in good shape. So we'll save that So you'll recall that previously we pulled one genome from each species I want to do that again, but I want to do it a hundred times so I can look at the variation that's caused by picking um Across you know many iterations of doing that because it's a random process. We've got our set seed there as well um, I'm going to stop calling these esvs and call them asvs for now um and I'm also going to move this mutate line up To mutate the threshold line. So if it says threshold esv, I want that to be zero um, and if I run these lines, I need to run everything It's complaining that yes, that should be esv not okay From that and so we've got take this a second to read in so we can look at metadata And then as and so that's our genome id and species and then we can also look at asv and this gives us um the asv name The genome the count the region and the threshold that was used to define um that asv or otu This other stuff I'm ultimately going to bake into a function And what I can do is I'm going to Work up to building a function. So I will take Let's see I'm going to take the metadata And we will group that Um by species and then we'll use slice sample to get one genome per species And so if we look at this we see we get 4,774 um different species Um as well as genome IDs and then I can then pipe that Into an inner join. I've got an extra sign there And so then I can replace that metadata with a period and this then will Combine it all together. So we get um genome id species the easv and there's multiple esvs asvs per genome But it's also partitioned out by region and the threshold that we use to define the asv or the otu Good. So like for example here, we've got two asvs from this Catharanthus rosius species and genome, okay Very good. So if we then return back to what we had done over here We again had that metadata esv. We grouped by All these things grouped by region threshold genome id to summarize the number of rns the Number of asvs we group dropped that And then we got we grouped by the region threshold and the number of copies and then got the quantile and the quantile I put this in the wrong space the quantile And for now i'm going to remove these comments Be sure to add my pipes the quantile told us The 95th percentile at the 95th percentile how many asvs or otus were there? And again, uh, this takes a couple seconds to run and so again This is kind of the heavy lifting right if I had to do this a hundred times inside of my rmarkdown document That'd be really painful And so what we can see is that we've got a region. We've got threshold We've got the number of rns and then the upper bound And the upper bound Is the number of asvs that we saw for a 95th percentile, okay um Yeah, so again at a zero threshold. We've got that And again, I can Do it a kind of a hack here that I normally do when I'm Building out the code which is to do something like x equals period capital last value And so last value is the last thing that was outputted So again, I look at x I get that but I could do x and pipe that to filter And I can do threshold Uh equals equals say 0.03 Or let me instead do um n rns equals equals seven And look at that And so what I want to know is what is the threshold where I get one back? Right. And so another way of thinking about that Would be to do Filter Um upper bound Equals equals one And so now looking at this I see for v19 That For these thresholds, what am I doing? Yeah, so for x I've got this and if I look at upper bound of one This then shows me um how many I have right and so here's a case where And I probably want to do this by the number of copies. So let me do Yeah, that's good, and we can then say let's group by n rns And n by region region n rns That and then we can do summarize um Let's say threshold Equals We'll do min threshold Because that will give us the minimum threshold to get one asv or otu Per copy number that we have here, right? And so if we're looking at full length sequences And there are seven copies in the genome then we would want A 5.5 threshold, right? Good, and we can then of course do dot groups equals drop to get rid of that that grouping And I can continue this pipeline that I had up above without using the x There as well, and then this will output All this information for all of all four of my regions and all thresholds that I'm interested in And yeah, so let's call this a function and we will call this get What as threshold for single otu Yep, and function and what I'll do is I'm going to create an argument for called prob and the default will be 0.95 And I can then change this 0.95 to be prob So if I decide later that I'd rather do say 90 percent or 80 percent, I can change it there I think in the text of the manuscript. I said A 90 percentile, but I think 95th percentile is probably a bit more robust So we'll save that and what we could do of course is we could then say map dfr And let's test it out with three reps and we could then do tilde get threshold for single otu And we'll output that and this we will then say um Thresholds for single otu And pipe that out So let's give this a run. I think each time we run get threshold for single otu Takes maybe half a minute or so. I'm going to test it by doing system Dot time on that function And let's see how long it takes to run. Ah, it's complaining Could not find function get threshold. Oh, maybe I Did I forget to do that get threshold for a single otu? Yeah, let's give this another shot All right, so that took about 30 seconds or so to go that was actually much faster than I thought um, if I look at thresholds for a single otu um Let me do filter um n rn equals equals seven um Give that a shot and so what we see is that I've got v19 in here three times So I could say do an arrange by region So it kind of sorts it by the region column there and I see that I've got pretty similar thresholds of the v4 Bops around a little bit. And so again, that's the value of you know, eventually doing this a hundred times um, and so what we can do is Um, we can do that and we can then I'm gonna For now, I'm gonna take off that system time and pipe this and I will do a group by um What I say region and n are ends And I'll do a summarize and I can say Threshold Equals mean Threshold And I will I'm gonna do a median And I'll do iqr Um Threshold All right, let's just do iqr. Um, and that'll be the iqr on the intro quartile range on the threshold And again, if we run this it'll take like half a minute and We'll see that we get the right output and then we can scale up to say doing it a hundred times Okay, so that went quick. Um, of course, I forgot The dot groups here To get rid of that, but let's look at thresholds single otu I see the iqr is very small again. It is for only Um three iterations. So it's it's going to be pretty small um So something else that I would like to have in here is the number of genomes. So perhaps, um Um, let's say threshold min threshold. Um, I'll also do N genomes And that will be the end function And also here then Where I summarize the output, maybe I could also say N And I'll I'll say median n genomes Okay So good, um So if this takes 30 seconds to run Um 30 so if it's like 10 seconds per iteration and I'm going to do a hundred iterations It's a thousand seconds divided by Uh 60 seconds per minute. It's going to take about 16 minutes to run Um, let's see if we can't borrow some of our skills from From the the uh future package for uh to speed up this map dfr step But you might recall from the episode where I talked about building those rock curves that our studio It does not like the future or fur package. So we need to be able to run this From the command line and so I'm gonna put in the shebang line here, uh, which You'll recall is the this line up here Which tells it tells bash to run this script Using the vanilla r script and I need to make some other modifications But because I use this function so rarely, um, I need to borrow from my git rock Data script and so of course I had library fur Which I have here and then also there was some setup stuff that I needed to do down further Which was to plan multicore So I'm gonna put this here So plan multicore and the other thing I remembered that I needed to do Was to set options the fur options to put in the seed And I can do that here So we'll do map dfr And then I can put in my options Fur options 1976 that all looks good. Um, and I can Maybe put this all on separate lines So it's easier to see on the screen All right, so that's good and I need to remove the set seed from up here So I'll save that And I will come back to my terminal and I will ch mod that to be executable. So code Was get threshold. Yeah And then I can do code get threshold For single o2 and let's see how long this takes. Um, hopefully it's quick All right, so that went but I realized I didn't give it any output Um, so let's see. Um, so I'm actually going to get rid of this thresholds for single otu And I can output it to that file, all right? Um It's maybe I'll bump this over a little bit so it's easier to see and I can do write tsv and it'll do data processed And I can call this um thresholds For single otu dot tsv Yeah, so let's give us another go and hopefully it'll go quickly here. So that took about 30 seconds to run. Um, and if I look at ls data Processed um thresholds. It's there. Uh, and let me do a nano on that actually And so we've got our regions our Number of copies the threshold the iqr and the number of genomes that are there um This does not look right um The number of genomes is clearly not right. Let's go back to our studio and double check what we are doing here um And so I Where'd I get the number of genomes? Uh, so I calculate the number of genomes here. This this is probably actually the number of Thresholds that had a value of one in the upper bound. That wasn't what I wanted Um, I probably want that instead Uh to be calculated up here where I did upper bound Uh, so oops, let's do n genomes equals n function With that And again, we can put these on separate lines. So it's it doesn't scroll across the right side of the screen And here then we can also do summarize um Let's see. Um, let's do n genomes equals um media and genomes they they should all be the same value but We'll run that And then let's do get threshold For single otu and let's see what the output looks like as always there's a little bit of uh Fits and starts and getting this all going Okay, so we see the number of genomes on here as well. So we're in good shape And I think this will be good Um, oh, I forgot to do my map dfr Again, if we look at a rock data that this should have been future map dfr Okay, so let's give that a shot. I'm going to set this to be 100 And we'll give that a shot here. Let me rerun this Fire that up and we should be in good shape While this is running what I'd like to do is I'm going to create another tab and open up My my project in adam here is here in adam. I'm going to go ahead and add the rule to my make file To build that file And let me look at data process to remind myself what I called it And if we come down, um Yeah, we can say here data forward slash processed forward slash get threshold for single otu dot r And that is going to be dependent on code forward slash get threshold for single otu dot r And also it is dependent on a couple files. So this genome ID taxonomy As well as the count tibble that we've used for so long now and we can then fire that up by using our special variable the dollar sign carrot and I can then Um, oh, I don't want dot r. I want dot tsv And it's not get threshold. It is thresholds for single otu dot tsv and I spelled the single wrong Ah, so threshold Thresholds for single otu dot tsv So we'll save that and we're going to have to Come back to this and Fix our typo here And so we'll have to rerun this To get this to work, but that can also then be a dependency for um, this file for building our PDF and our word file of our manuscript So we'll save that and That ran it took I think just a few minutes. So I'm going to test this out with make Um, uh, what was it? Um data processed Thresholds for single otu dot tsv and that needs to be an e Um, so we'll run that that might take another couple moments And in the meantime, I will go ahead and remove data processed Threshold first without the e And it'll be in good shape And again, if you look at top one of the nice thing about the future app Is that you can see all 16 processes running on my computer They're not at 100 because the software that I use to to capture what's going on on my screen is called screen flow and that uses Some amount of cpu as well. Um, and so It's not running as fast as it would if I had nothing else running on my computer But it's still faster than only having One cpu going to it at a time, but this will take a moment or two. Um, and then We'll be in good shape. So let's come to our studio and we can then return to our manuscript And right in here, we need to go ahead and put in a code chunk So we can fill in these x values that I put in as as holders so we can do the back back back are And then close it with three back ticks. Remember, we don't need to say echo equals false because in the last episode Be sure you check out the last episode I showed you how we could set echo to false for all of our code chunks And so I will put in here a argument. Um, a variable called threshold Threshold and it'll be read tsv and it'll be data or I really do here, right data forward slash processed forward slash and Let's see if it's done yet. Not quite, but um, I'll go ahead and Copy and paste that in And we will wait and see What the data looks like Once this finishes it should just be another moment or two Great. So that finished running we can come back to our studio Session and remember that we can run all of the previous code chunks by hitting this icon here Which has got the down arrow going to the horizontal line Run all that might take it a minute moment Good and we can then look at read tsv here And so that all looks good and Something that we now want to do is to insert values from that data frame into the text And so what we're what we've got here is we're talking about the propensity To lump species together as well as split things apart um So maybe I'll say to split species Apart to split a genome apart To split a genome into separate Yeah apart or to lump species together, okay Um, I identify the threshold where 95 percent of bacterial species would be represented by a single otu For the full length 16s. I found that a threshold of whatever 95 percent um Of the species would be represented by a single otu so I feel like I Repeating myself here, right? so Let's get rid of that because it was kind of redundant um For a full length I found that a threshold of whatever 95 percent of the species would be represented by a single otu Similarly, blah blah blah. Good. Okay. So what I want to do though is because we really have the data for many numbers of copies And so what I'd rather do is to correct that typo but I'd also like to Indicate a number of operons per genome or per species um Maybe I'll say for full length 16s sequences. I found that Of the species with let's say seven copies of the rn operon Would be represented by a single otu so we can constrain it and you know what? Maybe we'll use that data in a figure and so then we could then point them to a figure But I'm not worrying about figures right now. I'm worried about getting these values in right? So if we look at thresholds um We can then let's filter that to say Uh, not n rn equals equals seven And now if I look at thresholds I get these uh four values for the four regions and I'm interested in that threshold value You can see the iqrs were very tight. Um, not much difference there. And so I can say threshold v19 Is thresholds and I'll do filter region equals equals v19 And I will then pull threshold And I will then multiply that by 100 to get it as a percentage We talked about that in the last episode and so that is 5.5 percent as a threshold And we can copy and paste to do v4 3 4 4 5 v4 3 4 4 5 And let's fill all those in And again, um, I created a special function to format those in the last episode Um, which is down below here. Uh, or I called it down below, which was format pct this Okay, and so let's come up here and insert those values Um, and we can put that here and again, that's threshold instead of balance Threshold v19 Um, let me get the whole inline code junk and So this needs to be v4 3 4 and 4 5 And I'll add a respectively here Oh, I gotta spell it right Good, so we've got that information inserted. Um, there is still this percentage that I want to come back to probably in the next episode to finish off our result section So let's go ahead and make our submission manuscript pdf Make sure those values get inserted into the document But what you'll notice again going back to that issue of heavy lifting Is that that heavy lifting that took a couple minutes to run to generate my thresholds table Um, uh, you would be slowing this down even more And instead what we can focus on in the manuscript is writing the manuscript and inserting the values into the manuscript Rather than, uh, worrying about the heavy lifting of the the computes So we'll open this up. Let me make this a bit bigger so you can see it more easily And we'll come down to that There's some output here that I need to get rid of But you can see that I've got my two and a half four and three and a half percent Very good And I need to now silence these column specifications And we can do that pretty easily again if we come back up here And I can get a sense of what the specifications were because it outputs it to the screen here And I can add that doesn't need to be period it needs to be a comma call types equals calls and I will say, um Really it's region that needs to be call character everything else could be call double So I will say region equals call character And we then need dot default equals call double And do I have all my parentheses? No one more there And that should get rid of it for me and again because I have a make file I can just make stuff right and it's it's very easy to re-render it for me And then I can look at the output here in the pdf once this is completed And if we scroll back down, I think we were there already before I moved it We now see that that is gone right that code chunk where it told us the column specifications is gone And we now have The nice formatting of the thresholds to get one otu per species The flip side we'll talk about in the next episode is this percentage of the percentage of species that shared a ASV or shared a 16 s sequence between them So we'll talk about that in the next episode. So in conclusion one thing I would highlight for you Is that again if we Let me for a moment remove the submission manuscript pdf And build this again and as this flows through you'll notice that it kind of slows down in different spots And depending on like how much you're in a rush and how much you're kind of Printing and editing your document those pauses up here in those previous code chunks May annoy you right you may want to speed those up And so it's a trade-off right so there's obvious things like the compute that we did today that took a couple minutes to run You don't want to sit and wait for that to run as you're worried about kind of removing You know outputting code chunks and things like what we just did Whether you're willing to wait five seconds for it to get through a step That's up to you ultimately and so there is a trade-off between how much compute you do within the rmarkdown document Versus putting that into a script that is outside of the document One of the other nice things about putting it in a script outside of the document Is that now I've got that file and I can use that in another r script to build a figure to show the data And so again, that's the value also of doing those big computes outside of the rmarkdown Is that the data become available for other applications? All right, so A lot of review today But I think it's useful to see that review in a different context with different application as we saw today The next episode will come back and we'll talk a little bit more about redoing those computes for lumping and splitting And because that's going to be heavy lifting we'll also put that into a separate r script Well, thanks again for spending some time with me to watch this project unfold. We're getting to the end Trust me on that. I really appreciate the time that you spend I know you all are busy and you have a lot of other stuff going and I hope you get something out of us If you do, please be sure to leave a comment down below. I'd love to hear what you're thinking about it What questions do you have? I'd love to answer those in another episode Of course, please tell your friends about these code club episodes and we'll see you next time for another episode of code club