 It's great to be back with you all after a couple of weeks away. In the last episode of Code Club, we learned how to write an executable R script. That script converted a very wide data frame with more than 15,000 columns into a very long data frame with only three columns. R struggles with wide data frames, and we saw that when we ran our script. It took a minute or so to read in the wide data frame, but it only took a second or so to read in the same data from a tidy format. Most people would be satisfied that they had the data in a better format and then would move on with the rest of the project. Not me. I'd like to use this example to show you how you can learn what steps in your code are bottlenecks. I'll also get you to think about the trade-off between your time or programmer time and the time it takes your code to run or the execution time. One of my pet peeves are blog posts of people ripping on R or ripping on Python. They'll pull out some tests like how long it takes to calculate the mean. They'll then run that test using implementations in various languages. Because they don't know how to write code in the language they're trying to bash, they often present an unfair comparison. Also, lost in the conversation is that the differences are relatively minor compared to the cost required to learn the favored language. If you can spot these blog posters straw man arguments, then you're likely already understanding the strengths and weaknesses of the languages you're using. That's great because then you can avoid those weaknesses when you write your own code. R is designed to be efficient to write. For example, if I want to read in a file, I can use the read underscore TSV function. But in another language, like C++, I would have to write a whole bunch of code to do the same thing. The trade-off is often the time it takes to execute the command. The R version would be considerably slower to run than perhaps the C++ version. Thinking back on that example of how long it takes to calculate the mean of a bunch of numbers, you can make your own function in R to calculate the mean, but it will likely be slow. For most examples, the differences we're discussing here on the order of micro or nanoseconds for calculating that mean. Alternatively, you could use R's built-in mean function, which is actually written in C or C++. This will likely be as fast as actually doing it in C or C++ rather than in R. It's way beyond the scope of what we're trying to do here, but you can use a package called RCPP to write C++ code in R to optimize the steps that are too slow for your needs. This all brings up an important point that I mentioned earlier. A lot of your R code may only be run a few times. For example, the code we wrote in the last episode read in the wide data frame and output it as a tidy data frame. It took maybe two minutes to run. So I have to run it five times over the course of my debugging and execution of the project. That's 10 minutes total. Maybe I run it 10 times. That's still less than half an hour. How long is it going to take you to figure out how to speed it up? Maybe an hour or so? Is it worth going from two minutes to 11 seconds to run your code if it takes an hour or two to refactor the code? Sometimes. Today I'm going to spend a bit of time showing you an alternative to functions like read underscore TSV and pivot underscore longer that are far faster but have a less intuitive syntax. These functions will come from the data.table package. I would say it's worth learning these alternatives if you're going to be working with really large data files or files that are wide. I say this because the next time you run into this scenario, maybe you'll remember to use these alternative functions and we'll be able to pull out these tools to make your code more efficient. But if the improvement that you're trying to make is some slick programming trick that you're unlikely to need again, then I'd say maybe it's slick, but has it gained you any time? Beyond seeing how to use some new functions from data.frame, you'll also learn how to profile your code so that you can find the bottlenecks that you might want to refactor to make your code more efficient. Even if you're only watching this video to learn more about R and don't know what a 16s rRNA gene is, I'm sure you'll get a lot out of today's episode. Please take the time to follow along on your own computer. If you haven't been following along with previous episodes but would like to, welcome. Please check out the blog post that accompanies this video where you'll find instructions on catching up, reference notes, and links to supplemental materials. The link to the blog post for today's video is below in the notes. One of the reasons it's been a couple of weeks since my last episode was because I had some technical difficulties in upgrading my operating system from I think it was High Sierra to Mac Catalina. I also screwed up my hand, and so that kind of kept me on the sidelines for a few days. So anyway, it's great to be back with you. We might see in a few snafus as we go through here, all part of customizing our work environment to this new operating system. Catalina, I find, asks a lot more questions related to security and things. And also it comes with a default shell of ZSH rather than Bash. And so I had to go through a few steps to go back to Bash being the default. If you have questions about how to do that, go ahead and leave me a comment below. And I can show you the blog post that I consulted as to how to use Bash rather than ZSH. Maybe someday down the road, I'll go to ZSH and we can do an episode on perhaps the benefits of ZSH over Bash. Anyway, so we'll go to RRN as our alias to start our project. We see we're on the master branch. I'm going to go ahead and use Adam to open up my directory. You'll see that when I installed new things, I got a nice shiny new version of Adam. And so things look a little bit different. And the code that we had been working on is here in code, convert count table to tibble. And so again, this was my R code. And let's see. So the challenge is this chunk of code here takes, I think, took like two or three minutes to run for the V19 region. And what I'd like to do is today show some of the tools that we can use to profile this chunk of code to see what the bottlenecks are in terms of speed and then give you a couple other tools that you could use to actually speed it up in place of things like read TSV and pivot longer. So the spoiler is that read TSV and pivot longer are actually really slow. They're easy to use, but they're really slow. So to get to here, you might recall that I believe it was in count unique seeks that we call code convert count table to tibble R, right? So I need to go ahead and generate count table and count table is what I need to generate. Is that the only thing that I need as input to this? Well, I also need the output name. So the input, I'm going to just kind of input file. I'm going to call data V19. So instead of trying to remember what the actual commands are or what the actual names are, I'm going to go ahead and rerun all this. Instead of giving this dollar sign one in my bash terminal, I'm going to say target equals and then data V19 all this. Paste that in there. And then when I come back to here where I've got stub and all this stuff, I can rerun all these lines, copy and paste these into my terminal. And this takes a couple of minutes itself to run. Maybe while it's running, this would be a great time to be sure to click on the subscribe button and click on the bell so that you're notified when the next episode is released. This will take a moment or two to run. I'll fast forward through it in my editing and we'll pick up and we'll see what the temp files are that we can use into R. Before I get going too far, I want to create a new issue for optimizing and speeding up that code. So we want to see profile and optimize. Conversion of count table to count table file. And so experiment with using ProfViz. I think it's one word. To identify bottleneck and then implement solution to speed up code. So submit that as our new issue and this will be issue number 20. I'll go ahead and check out a branch for issue 20. So get branch issue 20. Great, and we're on that branch. And we see that from the output of mother that the temp file coming into our R script over here for input file is this data v19 RNDV temp count table. So what I'm going to do is I'm going to fire up RStudio. And because, if you recall, we created this .rproj file. I can do open Schloss all that .rproj and this will start RStudio with my current working directory being the project root directory. And we can double check this by doing getwd. And that tells us our current working directory. I can then also do code and then convert count table to tibble.r. And we have our files here. And we need to, what I'm going to do is I'm going to run different lines of this. Again, this was an R script. Remember, we have this shebang line at the very top. But we also have the args and then these lines 15 and 16 to convert those args to variables for input file and output file. And so I'm not going to run those lines in this interactive shell within RStudio. But I certainly want to run library to get that going. And then my input file, for now I'm going to comment out these two lines. My input file will be that, my temp count table. And my output file will be the same thing, but I believe it's going to be tibble. And let me double check in my make file what I was setting as, yeah, so it's rndb count tibble. Yeah, so we don't want the temp, we want count tibble. And we forgot where I am. Okay, back in our studio. And so the output is going to be rndb.count underscore tibble. So I'll run these two lines now. And those are loaded as my output and input file names. And we know that this chunk of code takes a couple of minutes to run. I want to know how long it takes and how long it takes to run each of these individual steps. So it's a package we can use called ProfVids. And to install it, if you come over to the packages and do ProfVids, you'll see over here, my font size is a little big so you can see it more easily. But you'll see that I have something here. If you don't have ProfVids already here, and you can kind of see the pop up underneath the hand of the icon here says ProfVids. Then you can do install, and here you could do ProfVids, and then you can click install. Again, I already have it installed. So I'm good. For this benchmarking, I'm going to do library ProfVids. I'm not going to put it into my code because the code that we use isn't going to be doing the profiling. The profiling that we do is going to be a tool to help us to better understand the performance of our code. So I'll do library ProfVids. To use ProfVids, what I can do then is to do ProfVids. Open parentheses, and then wrap that around the code that I actually want to profile. And I'll go ahead and highlight that. And if I hit the tab key, everything moves over a chunk, over a couple of spaces. So I'm going to go ahead and run this, and we'll see what the output looks like once it's done. Great. So the RStudio works with ProfVids to give us some really nice output. If you're not running it in RStudio, so you're running it from the terminal, then what happens is that this output is then exported to your browser. So you can see this in your internet browser, like Safari or Chrome or whatever you're using. And so what we see here is a kind of timetable of the commands that were executed. And so we see that it took 108,650 milliseconds. So that would be 108 seconds or just short of two minutes. And what you see along the x-axis then is in terms of milliseconds. So 20, 30, 40, 50, 60, up to 110 seconds to run the code. And what you can see is that this readTSV took about, as it says, 69,000 milliseconds. So it took over a minute to just read in the script. And as you read in the file, and so you can recall that the first step is readTSV, we then rename the ASV column, or rename the representative sequence column to be ASV, we then remove the total column, we then do the pivot longer where we take those wide data frame and make it narrow, and then we filter to remove those rows where count is zero. And then we output it. So if we look at the performance, most of the time, more than half of it, is spent reading in the data. The next most amount of time is pivot longer, which took about 35, 34 seconds to read in. So if we could shorten those two steps, the reading in and the pivot longer, we'd really go a long ways towards optimizing our code. And you'll see in here that like rename, can we zoom in close enough? Yeah, if I zoom in, I'm doing this with my mouse. If I kind of roll, let's see. So rename, you see here, takes 40 milliseconds. So not even half a second. Select also 40 milliseconds. So really, really short. So we wouldn't worry about rename and select because they're so short compared to readTSV and pivot longer. And if we then, let's see, what else is taking a while here? So I zoom out. Filter out here is taking about 4 and 1 half, 5 seconds. And then up here, the read, the write, is also pretty quick. So if we could speed up readTSV, pivot longer, and maybe filter, that would get us much better performance in what's going on. If we come back here, there is a package that's actually much better than the readR package that's part of the tidyverse for reading in these wide data frames. Again, the trade-off is usability. But for our purposes here, we're not really using any special readTSV functions or options to read in our file. So that's not such a big deal. The package we're going to use is called data.table. And again, if you don't have it installed, click Install Data.table, and then Install. I've got it installed. I'm actually going to add this to my script. And I can then run that line. And you'll see that this then got loaded. I should also point out that there's 50 warnings. And I've looked at these warnings from running that the prof is, and I'm really not sure what it is, and I can't track down what's going on. But it's not such a biggie. So we'll go ahead and we'll load data.table. And instead of read underscore TSV, we can do question mark, or we can do fread, which you can think of as fastread. And again, if we go to help and do fread, we can see fast and friendly file finagler. Lots of f's, lots of it alliteration, right? And so it's used for regular delimited files. And the default is, I believe, the default delimiter, I believe, is automatic. So it figures it out on its own. But if you wanted to specify sep to be a tab, you could do that. And there's all sorts of other goodies in here that you can use. But again, we're going to be in pretty good shape with using the defaults. So let's run this again and see what happens when we profile using fread rather than read underscore TSV. So that ran. It was a little bit quicker. We see that it took 35.3 seconds to run that whole pipeline. And we see that fread, the reading in of the file, went from about, I think it was like almost 65 seconds, down to 4.4 seconds. So really fast, right? It's just so much faster to read in. But again, the tradeoff is, as you get into the most complicated options, fread might just not be as convenient to use. Again, once you learn how to use it, it's not that big of a deal. So our next longest part is pivot longer, which it tells us takes 25 seconds to run. So we're going to go ahead and clean that up to speed that up just a bit. And instead of pivot longer, we're going to use a function called melt. So I think it was plier, which was a predecessor to dplier, had a function called melt. And so let's look at the arguments for melt. If we do question mark melt, this says fast melt for data.table. And so it says melt is data.table's wide to long reshaping tool. Perfect. That's exactly what we want. And so we give it the data, the ID variables, and the measurement variables, the variable name, and the value name. So our ID vars is going to be ASV. That's going to be the identifying variable. And what was the other one? Variable name and value name. And so instead of names two, we're going to use variable name and genome and value. Well, this is going to be value name, value.name, OK? And so again, you'll call that our count table has a column for ASV. That's the representative sequence. The total number of sequences that show up for that ASV across all of the columns. And those other columns then are the genomes. So we have the counts of the number of times each ASV shows up in each genome. And so what we're going to do is we're going to take all those genome columns and we're going to then collapse them down. And so we're collapsing everything but ASV. And on line 23, we've already got rid of that total column. So this should work. And we'll go ahead and profile this again. And I think I saw a small error. Object ASV not found. So it's complaining because ASV needs to be in quotes. Again, subtle differences in implementation. I'm going to go ahead and close this Profile 3 tab. And we'll then re-profile this. Should only take a few seconds. All right. And so you'll see we've got it down to about 13 seconds. Pretty quick. And so our fREED took about 3.9 seconds. And then our melt got down to 3.4 seconds. So really fast. And now the slope hook is filter at about 5.5 seconds. And so you can see that, if I kind of put these in order, we started at 110 seconds, give or take. Got down to 35 seconds. And then down to 13 seconds. And if we probably ran this a few times, we might see some variation in how long it takes the code to run. We could optimize this filter line a little bit more. But I think this is a good stopping point and allows us to see two functions that are considerably faster from data.table than they are in dplyr or readR as part of the tidyverse. Again, all we've done here is go from about two minutes to 13 seconds. The output should not be any different. In fact, if I come back to data v19 and then, oh, where'd you go? My count.tibble file right here, see what happens opening a 3 megabyte file. So we see now that we've got this three column file where we've got the ASV name, we've got the genome name that it came from, and then how many times it showed up. So that's really outstanding. It's the same output for the same input, but it's considerably faster. I'll go ahead and close these profile tabs. And I'm going to also save this. And I'll get rid of my ProfViz code, because, again, that was a tool to help me optimize my code to figure out where the bottlenecks were. And I'm going to go ahead and remove these lines 17 and 18 to clean them up a bit. And then remove the comments before lines 14 and 15. So what I've done here is I've added line 10 for data.tibble. I have then used fread instead of read underscore TSV. And I've used melt instead of pivot longer. I'll go ahead and save that. And I'm going to then go back to my shell script. And I'm going to then run this next line from my terminal. And this should work. And it should take about 11 or 12 seconds. Great, it was really fast, right? I can then run the rest of these lines to finish out the script. And if I do ls-lth on data v19, I'll see that we have count-tibble rndb unique align as being outputted from running all of this. So what would be really ideal then would be to go ahead and close this. And then I'm going to go ahead and I'm not going to save my R script, which I opened up here in Adam, because it doesn't have my changes. Those changes were made in our studio. And they haven't been updated to this file yet. So I'll go ahead and close this. I'll say, don't save. And again, if I double click on that, I now see if I reopen the file that my changes are here. And let me go ahead and make data v19 rndb count-tibble. And I'll go ahead and also do data v4 rndb count-tibble. And both of those should be created. This might take a little bit to run, because that initial step using by there now is much longer than the R code. Again, we see that this runs quite quickly through the R step, but the upstream other part was quite slow. Again, it's not that big of a deal in the context of the whole project. Now I'm going to return to Adam, because I need to update my readme file to say the dependencies. I'm going to go ahead and add my dependencies here. So I've got R. And within R, I'm going to say tidyverse. And I will say data.table. And within R, if I want to get the versions of things, there's several ways to do this. So the first thing to note about R is that I'm using version 4.02. So I'll say version 4.02. And for tidyverse, I could do library, library tidyverse, library data.table. And I could do session info with the capital I, open-close parentheses. And this then shows me all of the packages that I have loaded. And you'll see that I've got data.table 1.13 and tidyverse 1.3.0. Just to make things look fancy, I'll go ahead and put these in backticks since they're command names. And that then updates the readme with my dependencies. Very good. And we're then ready to close out the issue. I'll quit out of R, get status. I see there's two things that have been modified. I'll go ahead and do get add readme, code, convert, that. Again, we didn't change the bash script. All we changed was that R file. And we'll say then get commit-m, accelerate reading and gathering count table file, closes number 20. And then we'll do get checkout master, get merge issue 20. That's been merged in. I'll do get push. And if I look at my issue, it's closed. Now, because my timestamps get messed up, as I mentioned before, when I do the merge to master, I need to go ahead and rebuild those count table files. I probably shouldn't have waited to make count table file until this point. But I'm pretty confident they'll work. And I'll leave you here while I go ahead and rebuild those. There's no need for you to stick around for that. I'd love to see how you're adapting what I have covered in this and other Code Club episodes into your own work. Also, feel free to ask any questions you have in the comments below. And I'll do my best to answer them in a future episode. Please tell your friends about Code Club and to like this video, subscribe to the Riffimonas channel, and click on the bell so you know when the next Code Club video drops. Keep practicing, and we'll see you next time for another episode of Code Club.