 Hey, welcome back for another episode of Code Club. I'm your host, Pat Schloss. I had a great response to the last episode where I introduced the creation of our markdown documents. I've really found these to be just incredibly powerful tools for reproducibility, and they're perhaps the most important tool I have in my toolbox for reproducibility. And that's largely because one can trace where every value in a manuscript traces back to the raw data file, especially if you're using a make file to trace all your pre or upstream data processing steps. Also, if something happens in one of those upstream steps of the document then, through using make and through using these our markdown documents, it's possible to regenerate the values without having to search for every change. It's pretty awesome. We've seen this before where perhaps we change some setting in an upstream processing step, change that setting, and then you can regenerate the whole document, and you can look for where things might have changed or might not have changed. But in the analysis that we did in the last episode, something didn't quite sit right with me with the analysis we did. We analyzed the full length sequence data, and then we looked at the V4 data as well. But how would we process the data for these two sets of data together? Say I wanted to make a plot showing the relationship between the number of ASVs and the number of copies of the 16S gene in the genome across all four regions instead of just one region at a time. In today's episode, we'll see how we can use map functions from the per package to combine the output from functions together. Specifically, we'll use the map underscore DFR function to combine the count table files using the output from the read underscore TSV function for the full length sequence data as well as the files from the V4, V3, V4, and V4, V5 regions. We'll then update our last our markdown file to compare the sensitivity and specificity of Amplicant sequence variance across these four regions. Along the way, we'll also see how we can group two variables instead of just one using the group by function. Now, all of this might seem like an incremental advance in the project, but it's really setting us up for big things in the coming episodes, where we start to think about how we can group our data by taxonomic level, or different taxonomic groups, rather than by genome. Even if you're only watching this video to learn more about R and don't have a clue what a 16S RNA gene is or what an Amplicant sequence variant is or what V4 is, I'm sure you'll still get a lot out of today's episode. Please take the time to follow along on your own computer. If you haven't been following along what we'd like to, welcome. Please be sure to check out the blog post that accompanies this video, where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's episode is below in the notes. As I mentioned, what we'd like to do is we'd like to combine data from the full-length sequences, the V4 region, the V34, and the V45 region, all into one mega-tibble, if you will, one mega-data frame, that we can then use with our group by and summarize functions, as well as with our other tools from the tidyverse like ggplot, so that we can directly compare across our different regions of the 16S gene. Let's go ahead and open up an issue for this. And I will say compare genome-level ASVs across variable regions. And in here, what we're going to need to do is we will want to build count-tibble files for V34, V45, V4, and full-length. I'm pretty sure we already have the V4 in full-length. In fact, I know we do because we used them in the last episode. We have the machinery for V34 and V45, but I don't know that we've actually built those. We will then want to create a script to combine the count-tibble files. And we will then want to refactor our exploratory data analysis from 2020-09-09. And as you've seen in the past, I create these checkboxes, but never really check them off as we go along. So maybe I'll do a better job of that today. So this is me, our new issue. This will be issue 23. Coming to my project root directory, I'm going to go ahead and create the branch. Get branch issue 23. Get checkout issue 23. And we're on our branch. And of course, if we do get status, we see everything is up to date. Now, if I look at the contents of data V19, I see that we have rndb count-tibble. And what I could do is I could do lsdata star forward slash rndb count-tibble. And I see I only have the files for V4 and V19. So what I need to do is go ahead and if I need to generate the others, but let me make sure that these are up to date. I keep hitting the wrong thing on my keyboard. So if I do make on these two files, it should tell me everything's up to date, which it is. And what I can do then is go back and change these to 34 and 45. And this will run. This is going to take a bit of time to go. So I'll probably do some nice editing in the video so it goes a little bit faster. But while this is running for you, it'd be a great opportunity for you to go ahead, like this video, subscribe to the channel, and click on the bell so that you're notified when the next episode is released. Great. This all ran through. Let me rerun, make. It should tell me everything is up to date. And so we're good to go with these four variable regions. So I'm going to remember to check my little box here. And the next thing I want to do is create a script that's going to combine these four tibbles into one monster tibble, if you will. So let's go ahead and open RStudio. I'm going to do that by opening my Rproj. You might be realizing that I like to do things for the command line. It's much easier than kind of searching for my RStudio over here in my dock. I like to type who knows. All right, so this is firing up RStudio. I see that it's putting me in some other directory. But let me double check my working directory. So getwd does in fact have me in my project root directory. Maybe I'll click on that to get me back here. I'm going to go ahead and open a new R script. And the first thing I'm going to put in this is going to be library tidyverse. And as I mentioned in my introduction, we're going to use functions from the per function per package. And what you'll see is that when you load the tidyverse, you also load per. And if you haven't noticed already, people that make R packages love to put Rs and things. And people seem to like cats. So who knows? So you'll see per and four cats in there. Yeah. All right. So the first thing that we want to do is think about how we're going to read in our four tibble files. And I can come back to my terminal here and do ls data star count tibble. And these are my four files. Make it easier on my typing, I'm going to copy these in here, wrap each of them in quotes. And comma, comma, comma. And I'm going to wrap this in parentheses. One of the nice things about our studio is, as you saw, if I if I highlight these two rows and I do shift nine for the open parentheses, it also includes the closing parentheses. And so if I make that C, that now is a vector of my data files. And I'll call these my tibble files. Okay. Maybe I'll tab that in just a smidge. And so now we have this tibble files. Now, what I could do, and perhaps what you'd be tempted to do would be to create a variable where we read in each of these tibble files. So we would have four lines, we'd call read tsv four times to create four data frames. And then we might use something like the command bind rows to combine those together. And that would certainly work. But the challenge with that would be say I come up with another region, say I want to look at the v six region, I don't know why I would, but maybe I do, right? So then I'd have to come in here and I'd have to update my list, my vector of file names, and then I would have to add another row, right? And all that copying and pasting of read tsv is repetitious. And as we've I think talked about before, we want to keep our code dry, we don't want to repeat ourselves. So don't repeat yourself, dry. So what we're going to talk about today is using map functions from the per package to do this for us. And if we do map underscore df r, what we'll see in our help page over here on the right is that there are a variety of map functions, and these all come from per and they apply allow you to apply a function to each element of a list or atomic vector, right? So what we can do is we can use a map function to apply a function to each value in our tibble's file vector. So what function might we want to use? Read tsv, right? So what we could use there's a variety of flavors of map here. And the one we're going to use is map underscore df r, that returns a data frame, so df for data frame, created by row binding, or column binding, right? So dfr is for row binding, and dfc is for column binding. So we'll read in say v19, v4, and we'll bind those together row wise, right? So if we did it column wise, which probably wouldn't make a lot of sense, we'd have v19, v4, v34, v45, right? So we want to do it row wise. And what we see from the usage for map dfr is that we can give it dot x for the vector of names dot f for the function. And so we'll start there. And what we can do is map dfr tibble files. And the function that we want to give it is going to be read underscore tsv. And to be explicit, what we could do is say period x equals that. And period f equals that. Typically, when you read people's R code, they're not doing that. So we run this. And what we see is that we get a tibble with 112,000 rows and three columns, which is outstanding. And in the output that we see here, we got this parsed with column specification information one, two, three, four times for the four different regions that it read in. Now, what I don't see in here, unfortunately, is any indication of what region it came from. Well, if you look over here, on the right side, you'll see ID equals null. And reading down a little bit, ID allows us to enter a string or null, the default is null. If it's a string, the output will contain a variable with that name. All right, so we will do dot ID, and I'm going to say file name. Okay, and so that is going to create a column called file name. And I think it should put the name of each of the files in that column. So let's see what it does. And I can see already it didn't quite do what I wanted it to. It inserted a number for each seat of my file. So let me read this a little bit closer. So we'll contain the variable name with that file, with that name, storing either the name if dot x is used or the index if x is unnamed of the input. And so what it's meaning by named is that this is a unnamed vector that it's indexed instead of named. And so one is this, two is this, three is this, and four is this. So we could name it. And there's a couple ways of naming it. So let me show you a couple of these. I'm going to copy this line to illustrate them. What we could do, we could do v19 equals that, v4 equals that, v3, 4 equals that, v4, 5 equals that. And maybe so it all fits on one screen, I'll go ahead and add a line break after each of those. Now when I run this, and I look at tibble files, what I'll see is that I now have a name for each of these, so I could do tibble files, v4, and it outputs the name associated with that. So this is now is a named file, a named vector, sorry. So that's one way of doing it. I'm going to comment that out and show you another way of doing it, which will help us get a little bit closer to where we want to go to make this generalizable in case we want to add another region. So we can use the names function for tibble files. And what this means is take the vector that's on the right and apply that as the names attribute for our vector. So what I'll do is v19, in quotes, v4, quotes, v3, 4, in quotes, and v4, 5, in quotes. And again, if I do this version of tibble files, we see that there's no names. Now when I run names, tibble files, and look at that again, I now see that it's named. And the nice thing is that I can then run that with my map dfr. And my file name column now has the region only it's not the file name. Let's recall that region. So running that again, I see that I have all of my regions. And this is good, right? So again, there's a couple different ways to putting names on things. I'm going to use the names function for now. And we'll come back and revise this here in a bit. Now one thing that you might be asking is that, well, read TSV can take a number of arguments. And one of those is how we specify the columns, right? And there's there's lots of other arguments. So how might you get the arguments into read TSV? And so what we see over in the help here is that this dot dot dot, our additional arguments passed on to the mapped function. Okay, what that means is that I can add the argument call types. And in quote or single character notation, I can put the column types in here. And so the columns are ASV, genome and count, which when it guessed for us what the column types were, it got it right. So what I could do with single letter codes is put CCD, or N for number. And again, this is going to tell read TSV what the column types are. And I'm doing this from within map DFR. Now if I run this, you'll see that we no longer get that output telling us the column specifications. So again, you can pass information into your function in the same parentheses block, if you will, of your map DFR. So that's pretty slick. The next thing we want to do is output this. So we'll output the combined data frame with write TSV. And I'm going to put this into data processed. So not one of the data V directories but into my process directory. And I'm going to call this RNDB dot count underscore tibble. That is great. And and now if I run this, and I come back to my terminal, and I do LS data processed, I see that I have my RNDB count tibble. And I could do things like head data processed, RNDB tibble, see the top, I can also do tail data processed. Okay. And so we see we've got everything there. And that that looks great. Now, we could save this as an executable, and we'd be in good shape. But again, we might want to change what regions we're looking at, we might want to add different regions, perhaps we want to look at v5, v3, v35, what have you. So to keep things generalizable for the future, I'm going to make this an R script. Now, I don't always remember the chavang line, I don't always remember how to get arguments in. So what I'm going to do is I'm going to come back to my files, code, and open up one of my old R scripts. And I'm going to shamelessly copy this header into my new R script. And as well as this args function, which we used to get arguments into the R script. And so again, the chavang line is user bin and R script with hyphen hyphen vanilla. The name of this R script is going to be combine tibble files or count, let me say combine count, tibble files. And so let me save the file with that name. Let's see. And we'll go code. And then I'll plop that. It's not letting me paste. Combine count, tibble files dot R. All right. And so the input are tidy version of a count file for each region. The output is going to be composite tidy count file with a column to designate the region. And I'm going to say, I'm going to delete this. And we'll come back to what we might put in there. Now, to get this to be executable, remember that we need to do chmod plus x code, combine count, tibble files, and we can then do code combine. And this should work. Yeah, that works pretty nicely. And I was hesitating whether it worked because I put this line in here, but we didn't have any arguments. We didn't call any arguments. What I'd like to do is comment out these two lines and make my arguments that I'm going to pass into this script, my tibble files. Okay, so that's going to read in a bring in these four file names. But what we need to also get out are our names for that. So maybe I'll bring this down to keep everything together. And I'll do names, tibble files. And so what we're going to use is a new function that I don't think I've shown you before. I'm not going to spend a lot of time talking about now, but we'll come back to in a future episode is str replace. And so this comes from the string our package, which is also part of the tidyverse. And so the string that we're going to modify is tibble files. So it's a vector of strings pattern. I'm going to say in quotes, it's going to be data forward slash. And I'll briefly explain what this all means here in a second. All right, so what we're going to do is we're going to use string replace all which is going to take the strings in tibble files. The pattern it's going to look for is data forward slash something in between another that forward slash another forward slash followed by our rn db count tibble. So if you look at these tibble file names, they all match this pattern. And what's being matched in the parentheses is the variable region. And what this back back one does is it takes this value and replaces this whole pattern with whatever in those parentheses. And we can see that this works. If I run just the highlighted area here, you'll see that that then outputs those variable regions. So one of the problems with doing it the way I have here online 14 is if I got those out of order, that would be a real mess. And so by doing it programmatically this way, I'm less likely to cause myself problems there. All right. So what we've changed now is that instead of hard coding the tibble files, we're going to bring those in. And what I can do of course, is I can do code, combine count table tibble files. And I will then do data forward slash v star rn db count tibble. And so that v star will match all four of my variable regions. I run that. And if I look at the timestamp on my data processed, I see that that's 1102, which is right now. And so that worked great. Again, I can do head data processed rn db count tibble. I can do tail to make sure that that also worked. And we're in good shape. Okay. So I think that's mostly what we want. So the note that I'm going to put in here is that the region name needs to be, or that the input, the input file names, I never realized how many typos I have when I type until I'm recording these videos. In the input file names need to be data. So I'll say data. And then region. And then rn db count tibble. Okay. And so again, if you're using this down the road, and you change the file format, the naming system, the directory structure, then you'll have to update that. All right, so this is all good. I'm going to go ahead and remove these couple of lines because they're not really necessary. And so here we are a really simple set of code to combine those lines. Now let me come back and open up my make file. And I'm going to come down here towards the end. And I will do data processed forward slash rn db count tibble. And then I'm going to do a bunch of copying and pasting here. Don't need that there. So this is going to be v19, v4, v34, v45. And, you know, I'm going to put my R code on this first line. And so that's going to be code forward slash combine count tibble files.r. I think that's what I called it combined count tibble files.r. Yep, we're good. And you may recall when I talked about make files that there are special variable names that we can use. I'm going to use one of those here. And so that's the dollar sign carrot. And what that means is that in place of dollar sign carrot, put all of the prerequisites for the rule. And because I put code combined count tibble files.r first, that is going to call the executable. And these then will all become the arguments. Okay, so to demonstrate this, I'm going to go ahead and remove data processed rn db count tibble. And then I'm going to make data processed rn db count tibble. And so you can see the output here of what it's actually running. And so it runs all that it uses those as the arguments into our code. And again, I can do my my head data processed rn db. That looks good. Tail data processed rn db. And that all looks good. It's excellent. We've gotten that all to work. I'm going to return to my issue. We've created the script to combine the count tibble files. So I don't want to spend a lot of time modifying our previous rmd file. But I kind of want to briefly show you how we would do that. And so I'm going to come back to my exploratory directory and open up my rmd file. And all this is going to stay the same. Except I'm going to change from full length to say ASV, let me say count tibble. All right. And then this instead of v19, I'm going to put processed. And I should go ahead and run these lines. And and what you might notice also is that there's this play button to the right that if you click on this in our studio, it'll run that chunk for you. Something else that you'll notice is that in this case, it outputted the output of the tibble right into here. And I believe you can change that. How can you change that? I think there's a way to change where it outputs. Yeah, so we can say chunk output and console. So I'm going to go ahead and do that. I'll remove that output. If you like it in line with your rmd code, more power to you, I prefer it in the console again, just kind of the way that I work. Maybe here I'll go ahead and add call types. And in quotes, this is going to be cccd. So we've got region ASV and genome or all call characters and count is a double. And again, if we run that, that all looks great. And I'm going to change FL to be count tibble. Now before when we did this, we grouped by genome because we only had one region. What we can also do is we can group by the region and genomes will group both of them. And so we will then have all this. So we'll have summary data for both all of our regions and for all of our genomes. And this then should give us well, this isn't going to change because it's regardless of the region you're looking at, you're going to have the same number of copies of the success gene in your genome. So we'll do that. And this gives us output. So for some reason, reason. Yeah, I think it's yeah. So there we go. This this new thing with groups drop is probably going to drive me nuts. Just because because it's just new and it's going to take me a while to get used to it. So what I'll do up here for counting things, instead, I'll say filter region equals v19 output that. And so for this part, where we're looking at the total number of copies per genome, I'm not going to group by region. And we'll run that. And so that looks good. Here, we'll also do filter region v19. We don't need to group by region then. And we run that. And we again see our table output here that again, we have this peak here at about seven. And that's probably because there's close to 1000 E coli genomes in the database. All right. So determining the number of ASVs per genome, we could put in here. Countable. And here now we do want to look at region wise. So we'll say region genome. And we're going to group by region. So we're going to group by region and genomes to get the number of ASVs the number of operands. And then we're going to group by region and number of operands to get the median, and then the lower and upper quartiles for all of those. And what we see in our output down below is our region, along with the number of copies, the median number, the quartiles, I'm going to go ahead and add the mean and ASV. And again, what we'll see then is, again, the data aren't normally distributed. So mean isn't probably the best thing. But for some of those regions where we have fewer ASVs, you know, just getting one isn't as informative as say, perhaps like 1.4. Now, because we have an output table that has 80 rows, that's really difficult to look at. I'm going to go ahead and filter and look at those genomes where the number of RN values is seven. Okay. And so then we'll see how many ASVs are in those genomes at our different regions. And so you can see that V19, 34, V4, and 45. Again, for seven copies, we have five, two, and one ASVs per genome where there are seven copies. And so again, this is putting all that information that we did for full length in V4 together. Now, let's make a plot of this. And so we can do count table. And we'll do region and genome. We'll summarize all this. And I'm going to modify this because if we run this like it is, we're going to get a single line. And what I'd really like to see instead is what? Is a line for each region. And what I can add here then is color equals region. And what this will give us then is a different line for each region. And so you can see the steepest line for V19 and that the other regions have flatter curves. Okay. And so I'll add some text here to say the sub regions of the 16S gene have fewer ASVs per genome or per RN operon. Now we'd like to determine whether an ASV is unique to the genomes they're found in again across all those regions. And so we'll update this to be say count table. And we'll group by region and ASV. And we'll then summarize by this. And let's see what we get when we run this. We'll see that it's still grouped by region. And we can then count and genomes. And then we see that for each region, for each region, we get a number of genomes that then have how many ASVs are in that genome. And we run this. And we find that that the number of ASVs that are in one genome is about 19,000 for full length. And so that's about 82%. Again, to simplify the output, I'm going to filter for those to find the fraction of ASVs that are unique to one genome by doing n genomes equals one. And we can see then that as we saw before, for full length, it's about 82%. For V4, it's about 76%. And the V34 and 45 aren't much different. And so we see unique to genome and that the subregions, the number of about 76% of the ASVs were unique to a genome. Okay, what we've done here is we've taken the secondary analysis and embedded it into our overall analysis. So I'm going to go ahead and remove this section because it's redundant with what we've already done. I can save this. And I'll go ahead and knit it. Make sure there's no errors as we run this. This is kind of telling us this output that popped up is kind of telling us what was going on when it read in that our library. And here's what the markdown document looks like that gets spat out. Looks pretty nice. I'm going to go ahead and close that window. And this is saved. And coming back to our issue, we've done all three of the things we said we wanted to do. And I can then, what can I do? Get status, I'm going to go ahead and commit my changes. And we will then do get add make file exploratory 2029 star code combine. Okay, those are all staged. And I'll do get commit dash M. And we will combine the count typical files and update our genome level exploratory analysis. And I'll say closes number 23. That commit message was probably a bit long. So it goes. So we can then do get check out master. And then we can do get merge issue 23. That's all good. And we can then get push. And returning to our issue tracker, we see that that issue has been closed. Awesome. So again, we're building up our analysis as we go through here. And the next step what we'll do is we'll go up to the species level. But to do that, we're going to have to learn a new function called inner join. So really, be sure that you are subscribed to the channel. So you know when that next next episode is dropped, where again, I'll be talking about how we can read in and join different data frames together. This analysis is moving forward. I'm really excited about it. I'm glad you're joining the journey with me. If you've got any questions or comments, feel free to put them down below. Below this video on YouTube. I'd love to hear from you. I'd love to hear how you're incorporating the things I've been talking about into your own projects. I've got a lot of great feedback on Twitter from people about how they've learned a lot from all of the episodes. Share that with me. Let me know what you're learning and what you really like and take away from the video. If there's something you don't understand, or something you do a little bit differently, by all means, let us know down in the notes as well. Until next time, keep practicing, and we'll see you next time for another episode of Code Club.