 Are you ready to learn some R? I thought so. To this point, I've been showing you how you can process data files in Bash using reproducible practices. Now we're going to switch tools and start working with data in R. This will move us closer to quantifying how sensitive and specific Amplicon sequence variants are for differentiating bacterial populations in microbiome analyses. Perhaps you already know a little R. If not, I have several tutorials at the Riff Monos.org website. You can use these for free and you'll be able to learn a lot of fundamentals for analyzing microbiome and other types of data. Typically, when people analyze data in R, they enter commands at the prompt in R Studio. This is a bit like how we can run those Bash commands from the command line interface. It works and is very powerful, but it isn't automated and it is an approach that will play well with the make file. In today's episode of Code Club, I'll show you how you can create an executable R script that can be run without starting R. It'll take input and uses good reproducible research practices. We'll see how we can integrate this script with our earlier analysis. If we haven't met, I'm Pat Schloss. I'm a professor in the Department of Microbiology and Immunology at the University of Michigan. For the past 20 years, I've been studying microbial communities using sequence data. Over that time, I've been developing an approach to make my data analysis more reproducible. Unfortunately, many people often feel that analyzing data with code is more they can bear, but I believe you can answer your own data analysis questions. This and other episodes of Code Club are meant to help you grow in confidence to ask and answer questions about the world around us using data. Now, if you're getting something out of this and the other episodes, please be sure to like this video and subscribe to the channel so you know when other episodes are released, and also this helps others to find the material more easily. More importantly, your subscription will give me such an amazing dopamine rush that I'll be able to make it through the day. Even if you're only watching this video to learn more about R and don't have a clue what a 16s rRNA gene is, I'm sure you'll get a lot out of today's episode. Please take the time to follow along on your own computer. If you haven't been following along but would like to, welcome. Please check out the blog post that accompanies this video where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's episode is below in the notes. In one of our previous episodes, the last one that we're actually doing work on the project in bash, we created a script that unique our sequences for each region and then counted the number of times each unique sequence showed up across all of the genomes in our dataset. Now, this was done in mother and the output was a count table file where the rows indicated the different unique sequences and the columns represented the different genomes that we have in our dataset. We've got tens of thousands of genomes. The problem then is that that data frame is really wide. R in general gags when it encounters a data frame that's super wide. As an alternative, what we'd like to do is to make it tidy. What I'd like to have is a data frame that instead of having tens of thousands of columns might have three columns but tens of thousands of rows or hundreds of thousands of rows. And so the row the columns we might have would be say like the name of the Amplicon sequence variant, the name of the genome, and the number of times each Amplicon sequence variant shows up in each genome. The other thing about the output that we currently have in that count table is that there are zeros in there to indicate that the ASV wasn't seen in the genome. Those are kind of just there for structural purposes to make sure that the data frame is rectangular. We don't need those. So I'm going to go ahead and remove those. So let me show you what I'm talking about. Let's go into our project. And remember, we've got that nice rrn alias that we can now use to jump into this directory. And we have our customized prompt that tells us we're on the master branch. It's green. So we're good to go. If you don't believe it, we can do get status and we can see everything is good. All right. So I'm going to go ahead and look at nano for my v19 rndb.counttable. Hopefully this doesn't take too long to open up. So that did take a bit of time to load. I'm sorry about that. But again, you can see that we have our sequences that we've formatted in that last episode in the first column. The second column is the total number of times this ASV is seen across the genomes. And then each of the following columns is for a different genome. And then the numbers in here in these cells indicates the number of times this ASV is seen in each of the genomes. And again, what I'd like to have is a column for the representative sequence, but this is really the ASV, a column for the genome name, and a column then for the count. And we also want to get rid of all those zeros. Now, we can do this with R. And I don't really want to emphasize the R so much. I've talked about the R commands and previous commands, things that we'll talk about like read TSV, rename, gather or pivot longer and filter. And we'll see these again as we go on. But what I want to use today's episode to emphasize is how we can make an R script to do this formatting for us. And the reason it's important to do this format change is because we're going to read in this data multiple times in multiple scripts. And trust me, it takes several minutes to load this in when it's in this very wide format. But when it's in a narrow format, it's nearly instantaneous to read it in. And so I'd rather just speed things up as much as possible. And since this is a read that we're going to do a lot of times, I'd like to create a new file, perhaps something that ends in count dot count underscore tibble, because a tibble is the tidy data frame structure that is part of the tidy verse in R. Alright, let's fire up our studio. When you start our studio, we get the console. We also see that we're in our home directory. What I'm going to do is create a new project. This will make it easier to open our studio to the correct place. I'm going to do it with an existing project. I'll then browse to my documents. Let's see, where'd you go? Schloss R analysis. And this is the project root directory. And so once I'm here, I'm not going to go into any of the directories. I want it to be the Schloss R analysis directory. I'll click open. And this is going to be my project working directory. I'll say click project create project. And wall now my directory that's over here on the right is my project working directory. If I come back over to bash and do LS, I now see that I have a dot our project file. I'm going to go ahead and commit this, get add, get commit, dash M create our project. And for some reason, my get ignore file was modified, it must have been modified when I created the our project. And something that I can do to understand what change would be to do get diff dot get ignore. This will show me the differences. And it shows me that it created a added a our project user line to that. So I'll go ahead and add, get ignore. And I want this to be part of the previous commit. So we've seen this before also get commit hyphen hyphen amend. And this says create our project, which is what I had before. So I'll save that and quit. And we're green. We're good to go. Again, this could have been avoided if I had ran get status before the initial get add. But we worked through it. All right. Again, we see that we've got that our project directory there. And the benefit of this also is that if I'm in my browser, in my finder, I can double click on that and open up our studio to be in this project home directory. Now what I'd like to do is create a new R script. I'm going to go ahead and save this into code. And I will call it convert table to table. Maybe I'll do count table to table. It's a little long, but it's okay. So as an example of what I'm going to do, I want to work through the R code quickly to convert the count table for like, say the v one nine into a table for the v one nine. So let me double check that I know the names of these files data v one nine count table. And this was the path. And again, this turned red, because we've created that R script. Okay, so I'll say input file is this. And I want my output file to be the same thing, except I'm going to have it be tibble. So as I'm kind of building up this R script, I run each of the lines to make sure they work. And I can then say, read, I need to add my library, right? Because I was about to do read TSV. But R doesn't know what our underscore TSV is, unless I do library to tidyverse. So we'll do read TSV input file. And that will read in the TSV I'm going to rename, I'm going to rename that column representative sequences to be ASV. So I'll say ASV equals represent representative sequence. And then I will do select minus total to get rid of that total column. And then I will do pivot longer. And I'm going to say, so that the syntax, if we forget is to do pivot longer to do calls. And I'm going to do minus ASV, because I'm going to gather or pivot longer all the other columns except for ASV names to, and I will then say genome, and values to I'll say count. And I need to add a pipe there and I'll then do filter, count, not equal to zero. And I think that that should do it. So I'm going to go ahead and run all of this. This takes a few seconds. Maybe actually a couple minutes. Because again, this read TSV is really slow. Hopefully I didn't screw anything up. Hopefully this all works. And we'll see the syntax here. So while this is running would be a good time to maybe go like the video and subscribe. If you're not already, I know a good number of people watch these videos, but very few people will leave comments or like the video. So go ahead and do that. And again, it helps other people to find the video. Ah, it found could not find the function select. I forgot a T. The other thing that occurs to me is that I forgot to write this out. So we'll say write TSV output file. And we'll try again. Excellent. So that ran through. And we can then go look and do head data v19, rn db, count, tibble. And we see that sure enough, we now have three columns, the ASV name, the genome and the number of times it shows up. And we've gotten rid of all those extra zeros that were in there for structural purposes. Now, before I get going too far, I want to go ahead and create an issue. Because this is something that we want to be sure to incorporate into our pipeline. So we'll say let's convert the wide data frame, a wide count table data frame to a tidy count to the little data frame. So the count table files have over 15,000 columns, which is too wide to be read efficiently by our convert to a tidy data frame with three columns. So we'll say ASV genome and count. That looks good. We'll go ahead and submit this issue. And so this then is issue 19. And I will do get branch issue 19. Great. And so we've got that there. To kind of show how this is more efficient, if we do read TSV output file, let's see how long it takes to run. Bam, that's fast, right? Versus the two minutes that it took to run and load, we've now put it into a structure that's far more efficient to be read in by R. So this is good. Okay, well, this is an R script. And we could interact with this as we just have in, I'll make this a little bigger, in our studio by, you know, copying and pasting it into the console down here. But what I'd really like to have is for it to be executable from the command line. And ultimately, I'd like it to be able to take in any inputs or any output files we might give it. So let's start with making it executable. So when we made bash scripts before to make those executable, you might recall that we added at what's called a shebang line. So the shebang line for R is user bin and R script. And we'll also put in vanilla. So you might remember from two episodes ago when I was talking about an alias for R that we didn't want to load previous sessions and we didn't want to save this current session. Well, vanilla does the same thing. R script is a special version of R that comes with your R installation that allows you to run commands from the command line. And vanilla is the way to make sure that we're adhering to those good reproducible practices. So if I save this, I can then come back to my terminal, I can do that ch mod to make it executable. So code, convert count table to table, right? And then I can do code, count, sorry, convert, not count. All right. And this now runs. Right. And so I don't have to go into our studio to run it, which is pretty convenient. Because again, I could theoretically put this into my make file, and have it run through make for me, just to make things automated. So again, this takes a couple minutes. I'll speed up the video and we'll come right back. That ran through very well without any errors. And again, shows you how we can create an executable R script that we can run from directly from the command line without having to open our R studio. So the next issue are this input file and output file names are hard coded. I'd like to be able to change this, right? So I might want to use this code for v four or three, four, any other regions of the 16s are RNA gene. To do this, what I'd like to do is bring in arguments from the command line, kind of like we did with bash, you might remember with bash that we were able to use the dollar sign one to get those arguments from the command line. In our we can use command args, and then say trailing only. And we also really want to say true, we don't want the default of false. And so this then will be I'll call args. And for now, to demonstrate how this works, what I'm going to do is I'm going to go ahead and comment out comment out the rest of the code. And I will show you what actually happened. So what what would do is if I do print args one, I'll do print args two. And I'll save that. So you'll also want from red to black. And I can then rerun this command, but I'm going to give it two arguments. So the first I'll give it is data v one nine count that table. And then the other argument I'll give is data v one nine r and db count dot tibble. And what I'm expecting to see is output to my screen of these two file names. And so it takes a second because it has to load the tidy verse. But then sure enough, it outputs my two arguments. So that's how we can get arguments into an R script. Now what we're going to do is I'm going to again, remove those comments. And instead of hard coding the file name, I'll here I'll put args one. And here I'll put args two. And I can get rid of these print statements. And probably what would be a good practice, like we saw with the bash scripts, is to put in some information about the script. So I'll say name, convert count table to tibble r input a mother formatted count file output a tidy data frame with a SV genome and count as columns. And note, we expect input to be in order of input output. All right, so I think that's enough information to go with. And I will now go ahead back to my command line and run this. And we'll run it and we'll be back in just a second. That went through. And I can do lsl th data v one nine. And see that the count tibble was just created 1004, Thursday morning, August 20. So that works great. Again, I could also put in v four v four, five, and we'll get the same output. So I'm pretty happy. I'm pretty happy with the way this works. Something that maybe we'll come back to is that this read TSV is really slow, really painful. And there's actually some special r functions that we can use from another package outside of the tidy verse that will make this run in about a tenth of the time. So be much faster. But that's for another day. We'll probably do that in the next episode. I'm going to go ahead and quit out of our studio. And I'm going to come back to my count unique seeks script here. And what I'd like to do is to run this on the as part of my count unique seeks script. So we'll do code, convert count table to tibble.r. And the inputs, again, are going to be it's going to be this this stub temp count table. And then the output, we will call dollar sign stub dot count, tibble. And we're not going to need to do this move anymore, we're actually going to want to read remove. So clean up the garbage collection a little bit. And this will output it in the format that we want. And we'll also get our unique align file as well. This is good. And I'm noticing that I didn't put in any comment up here. So a name, count unique seeks.sh. And we'll say that this takes in faster file, and uniques and counts the sequences, it then converts count table to count converts, converts the count table. Sometimes Adam can be too helpful converts count table to a, yeah, sometimes it can really to try to be doing too much into a table with columns, ASV, genome and count to indicate the number of times each ASV appears in each genome. We'll say the input is a is the target, actually. So it's the target, the unique align file. And that reminds me to double check that this will work if our target is actually count dot tibble. So we'll need to remember to also update our make file. And I think this will actually work because what it's matching on to get the target name is right up to that R and db and it ignores that extension that so it removes the target. So it could be either the count or a line file. So the count tibble or a line file. And that is also the output. So we'll save that. That should work. We'll test it, of course, and then make file. I'm going to come back down to where we had count table. It will make tibble. I will also add that script that we just created. So code convert count table to tibble dot r forward slash that should be good. And I'll go ahead and test this say with like the v4 region. I think the v4 is going to work a little bit faster because there's fewer ASVs. So if we do make data v4 R and db count tibble, see if this works. So it's run through the mother steps. And now it's into our R script, as we can see from the output here. That worked great. Go ahead and do it LSLTH on data v4. And we see that we've got count tibble. We also have unique align. And we have in here the old R and db count table files, which I don't want to keep around. So I'm going to go ahead and do RM data star R and db count table to get rid of those old count table files. I find that yeah, we could save them, but it gets confusing, especially when it's not a target. And we're cleaning doing the garbage collection, that if we don't adequately do all the garbage collection, then keeping these extra things around can be confusing because we might think, oh, this is a good intermediate file, when perhaps we actually change something upstream. I'll go ahead and change that. And at this point, I think we're in good shape. Again, in the next episode, we'll see how we can speed this R script up a bit. But what I'd like to emphasize to you, and we can look at it here in Atom, is the two things to make something, or three things, I guess, to make something executable as an R script. We need the shebang line at the very first line. We need to use the command args function to bring arguments in from the command line. And as we saw, we could use chmod plus x to make that script executable. So we're in good shape. I'm going to go ahead and close this issue merge it into the master branch. And then I'll go ahead and build the other targets. So to remind you how we can do this, if I do get status, these all look good, I'll do get add make file, code count, and then code convert. Okay. And then get commit. And I will say, convert count table to table closes number 19, get checkout master, get merge, issue 19. And we're good to go. Before I push it, I'm going to go ahead and build my other targets. So I will do make data v19, rndb, count table, just to rebuild that. Soon this will work. And then I'll go ahead and push it up. This ran through took a while to run. I'm going to go ahead and push it up. And we're in good shape. Thanks again for joining me for this week's episode of Code Club. Be sure that you spend time going back through the code on your own to help reinforce your new skills. It would be great if you could take the ideas that we've worked through today and think about how they relate to your current projects. A great place for an executable R script is generating plots. I generally have a script for each plot that I generate in a paper. I can execute these scripts from the command line anytime I need to change the plot's appearance or update the underlying data. I'd love to see how you're adapting when I've covered in this and other episodes of Code Club. Also, feel free to ask any questions you have below in the comments and I'll do my best to answer them in a future episode. Be sure to tell your friends about Code Club and to like this video, subscribe to the Riffimona's channel and click on the bell so you know when the next episode drops. Keep practicing and we'll see you next time for another episode of Code Club.