 Welcome back. I'm Pat Schloss. Many episodes ago, I introduced you to the PivotLonger function. PivotLonger is a key tool for getting our data to be in a tidy format, where each column represents a different variable. Now, in that episode, I had a table where each row represented a date, and each column represented the price for different sized lamps. If I wanted to plot the prices for each size of lamp, though, I needed to have a single column for the size of the lamp, and another column for the price of the lamp, along with another column for the date. So I could take a data frame that was many columns wide, and condense it to three columns wide, but would be very long. In the recent series of episodes where I've been discussing Amplicon Sequence Variants, or ASVs, we have also used PivotLonger. We regularly use it to convert a table that has separate columns for each taxonomic rank, things like Kingdoms, Genera, or Species, into a tidy format. In the tidy format with the taxonomy data, we have a column that indicates the rank, as well as a column that indicates the taxonomic name of that rank, things like Bicillus subtilis or Escherichia coli we would expect to see in the species rank. But sometimes we want the wide format, because that is tidy for the application we're working on. For instance, thinking back to those pricing data, perhaps I wanted to plot the price of light lamps against the price of heavier lamps. In that case, a tidy data frame would have separate columns for the prices of the two classes of lamps, rather than just one. Similarly, in the last episode, I wanted to get the number of ASVs per species. In that episode, we could group by using the species column more easily than if we had a long data table format where we had a column for rank and a column for taxon. The point that I'm trying to make here is that whether a data frame is tidy really depends on the context and what you're trying to do. But how do we get our data into a wider format? Well, in today's episode, I'll show you. The function we'll talk about today is PivotWider. We've actually seen it a few episodes ago. In the episode where we did all the self-joins to build out the NCBI taxonomy for each genome, I kind of snuck it into a pipeline. In today's episode, I'll review PivotLonger and then we'll really dig into using PivotWider to pull our data back apart so make it more human-readable so that we can then use the Cable function, which will be a new function, from the Knitter package, which is part of the R Markdown ecosystem, to make our output look nice in our R Markdown document. Now, even if you're only watching this video to learn more about R and don't know what a 16SR RNA gene is or what an AmpliCon sequence variant is or why you should even care, I'm sure you'll get a lot out of today's episode. Please take the time to follow along on your own computer. If you haven't been following along but would like to, welcome. Please be sure to check out the blog posts that accompanies this video, where you'll find instructions on catching up, reference notes, and the links to supplemental material. The link to the blog post for today's video is below in the notes. In addition, I know we've been having a lot more people watch episodes lately and I want to reassure you that if you have questions or comments, please be sure to leave them down below in the comments. I do my best to answer everything that comes through. So you'll recall in the last episode, we were working on issue 30. We accomplished what I had set out to do in this first comment that I seated the issue with. But I occurred to me as we were kind of going through that material and developing the R Markdown document, that it would really be nice to see a table that would indicate the number of copies of the ASVs for each of the species, the number of genome sequences for each species, as well as like the average copy number. So the plots are great, but sometimes it's nice to have a little bit more context of kind of who we're talking about in terms of species, who some of those outlier points are, an attempt to maybe better understand what's going on in the data. So last time we did not close issue 30, we were still working on issue 30. Something that we haven't talked about before, but I can see the history of my previous commits by typing get log. And this then gives me an output showing where I am and where I've been, right? So we can see all these past issues as well as who committed them, right? So it's me on everything. And we see that we're currently on issue 30. And the last commit was to build a faceted plot by ASV rate by number of genomes and that this addressed issue 30. So we're gonna add another commit to this before we close it. But again, you can see where you've been. I find this to be really nice because if I've left a project and come back to it, I can quickly run get log to see where I've been and so I can remind myself of what I've been working on. To get out of this screen, if you type hit the Q key, you'll come back to your prompt. We see we're in issue 30. I'm gonna go ahead and open RStudio. So I'll do open Schloss, our analysis, the Rproj file. As our studio opens up, we see that we're in our project root directory. I'm gonna go ahead back to my files tab and open up the last R Markdown document that we were working on. And you can see all the header material that we're used to seeing at this point. I think my date here is wrong. It wasn't the fifth, it was the 15th, judging from the title. So I'm gonna go ahead and update that. And this all looks good. So to remind you what we've been looking at in this document, this first header loads the tidyverse as well as the here package, which is really nice for working with paths in our Markdown documents. We've joined together our metadata along with our information about what ASVs belong to which genomes. And so that's the metadata ASV data frame that I'll go ahead and load here. We then had some of this text about the number of ASVs per species increasing with sampling effort. And we found that it does increase with sampling effort. The V19, the full length, increased much more quickly than the smaller regions. And we created this data frame species ASV where we got the number of genomes, the number of ASVs, and the ASV rate. So the number of ASVs per genome. And this is grouped by the species as well as the region we're looking at. And then this allowed us to generate a plot which I will go ahead and regenerate here. And this is where we talked about making faceted plots. And so we see what it looks like down here in the bottom right corner. And there's, of course, things we can make this due to make this look a little bit nicer. But that's cool. So I think what I'm going to do is, yeah, we're going to come down and I'm going to make a new R chunk. And remember, that's the three back ticks with the curly braces with an R in the middle. And then you end the chunk with three back ticks. Okay. So before I go to that, actually, what I want to do is step back, probably giving you a headache, scanning back and forth like that, is to let's take the ASV meta, forgetting what the metadata ASV, metadata ASV, and we can see that this is what we typically would think of being in a wide format, right? We've got, as I mentioned in the introduction, all of these columns for the different taxonomic ranks. We've been working with the species column, and so we've been, we like this format as it is, right? Because it's much easier to work with the species column than to do like, you know, to filter and then, you know, then do group by, here I can do, I can select on that species column and go directly to group by species, and I have the data in the format I want. What we've seen before is that we can narrow it, we can tidy the data using the pivot, why pivot longer data function, right? And so you'll recall that we have to give it the columns that we want to pivot longer, that we want to pull together, if you will. And so I'm gonna do kingdom, phylum, class, order, and you've seen me do this in previous episodes without much comment. I just wanna briefly review it, right? And those are the columns that we're gonna pivot longer. The names of those columns, so names two are gonna go into the rank column. So we'll get a new column called rank and the values of those columns will go to, I'll call it taxon, right? And so again, if we run this, you now see that we've gone from a very wide data frame to a much more compact, tidy data frame. So it's much longer, right? So this has almost 900,000 rows whereas this other version had 100,000 rows, right? But you'll see that we have a column for rank and taxon. And so this is pivot longer. Another way that we could have done this, if you didn't wanna type out all that stuff, would be to, would I say use a negative approach so we could do pivot longer. And actually, you know what? I'm gonna copy and paste this down because much of the syntax is the same. And so instead of telling it which columns to pivot together, to join together, I'm gonna tell them which ones not to bring together, right? So that would be like genome ID and region, ASV and count, okay? And so those all need negative signs. And the negative means don't include this, right? And so running this, we then get the exact same output that we had previously, right? And so which way you go is really up to your, your preferences, whether you're a positive person or a negative person. I kind of like listing out the things I want rather than things I don't want because if the data frame, the input changes somehow and there's a new column that's added, being explicit about the columns I want kind of allows me to ignore that extra information. Whereas if there's an extra column added, then I would probably need to add it to perhaps be ignored in this negative approach, right? So again, it's up to you. So this brings it together. What if we wanted to then pull it back apart, right? And so what we could think about would be to say, let's call this tidy, okay? So to take this narrow data frame and make it wide, we wanna use pivot wider. And what we can do is tidy and pipe that to pivot wider and we will then say take the names from and this is gonna be coming from the rank column because those values, in the rank column, like kingdom file and class order family are gonna be like new column names and the values of those columns, we're gonna set with values from and that is gonna come from the taxon column. And what we see now is that the output is basically the same as metadata ASV with one small difference and that's that the count column in metadata ASV is at the very end and in our wider version, it's before the file or between the kingdom column, right? So otherwise it's the same exact data frame, same number of rows, same number of columns, it's good to go. So this has been a brief introduction or review of pivot longer and pivot wider. Remember that long makes it long and narrow, pivot wide makes it wide but short. In general, R prefers narrow, so not a lot of columns and I mean, you can throw hundreds of thousands of rows at it, no problem, but you throw hundreds of thousands of columns at R and it struggles. So, but anyway, let's think back to our issue now and what we wanna look at is this data frame species ASVs and again, we want a table that has the species name, the number of genomes, the average number of RNs and then the number of ASVs across genomes for each region of the 16S RNA gene, okay? So sometimes I like having a little mission statement like this at the top of my code chunk so that I know what I'm doing here and again, if I come back to it, it's easy for me to see what I've done. So again, species ASVs, we see that we have the region, the species, the number of genomes, the number of ASVs and ASV rate. So basically what I want is I'm gonna take my names from the region column and then I'm going to separate out then the number of ASVs for each of those regions. I don't care about the ASV rate so maybe what I'll start is with a select and we'll start with species, region, N genomes and NASVs, right? So that gets rid of that rate column and now I'm gonna pivot wider, right? Pivot wider and I'll say names from equals region, values from is gonna be NASVs, good? And so now we see that we've pulled it apart, right? We've made it wider and so it's much easier for me to see that arcobacter poisonous has like one genome and one ASV across all the genomes. So I see I don't have the average number of RN copies in here so I'll need to go back and add that. Something I like to do sometimes with these big data sets where there's, you know, 5,000 rows is to throw in a test case. So I'll add filter species equals equals Escherichia coli and so I see that it's got 958 genomes and all these copies, okay? Another thing that I might do is look at the tail, the back end and here I see something weird that I've got an NA column in here for Vibrio para hemolyticus. So maybe what I'll add is a sort or range. So I'm going to range by species and we'll tail that and let's head it. Huh, I wonder what happened to that one. Maybe what I'll do is I'll make sure I'll see what's going on, not head, tail. So Vibrio para hemolyticus and so I will throw that into my filter here and see what this looks like. So what's going on is that we have two rows in here actually for Vibrio para hemolyticus and that's because if you recall way, way, way back when we tend to have more genomes represented for the subregions than for the full length for whatever reason. I think there were maybe like 20 extra six or 20 or so V19 full length genes that we're missing you know perhaps ends of the sequences got truncated somehow. And so in this case, it seems that there was one or there's one that is different. So there's really 33 genome. So what I'd like to do is I can modify this but what I want to show you is that because so that what's going on here, right, is that if you think about what we had here and let me throw the filter up here for this para hemolyticus is that what we see is that this is the Vibrio we've got the four regions we've got four rows and we've got different number of genomes for those four regions, right? And so what it's trying to do is take these regions to make four columns and it's pulling apart the four ASVs. But the problem is that this is V34 with 33 genomes and this is V19 with 32. And so because these aren't the same it makes separate rows, right? And so we see this then in this next step with the pivot wider, right? And so if N genomes were 33 for both of these then they'd be on a single row, but they're not, right? So let's see about fixing that. And what I'd like to do is I'm going to group by species and I'm going to then say mutate. So I don't know if we've seen this before but you don't have to run summarize after group by. So we can mutate within each species, right? And so a mutate is gonna change a column for each species separately, right? And so what I can then do is N genomes equals and I'm gonna say max N genomes, right? And now what I wanna then pipe that into is I'm gonna do an ungroup to ungroup the species level. And now what we see is that it comes through and it gives us a single row back, right? And again if we'd have put this filter up above here let's say here, we'd see that they all have the same number of genomes and so the number of genomes, the species names are all the same and those then can be collapsed into a single row within four columns for each of the regions. So good, that works great. So I'm gonna go ahead and remove that filter and remove that for now and we look at that, great. So something that I said was missing was the average number of copies from the average number of copies of the operon in each genome. So I'm gonna come back up here to species ASVs where I'm defining all these things and I'm gonna make a new column in my summary output called NRRNs and this is going to be, let's see where we are at this point. So I'm gonna highlight these three rows and look at the output and I think I dropped out the count and so what I wanna use is the count. So if I sum up the count, right, that's the number of times each ASV, so this ASV shows up three times in this gamma proteobacterium for the V19 region and so if I then add up the total number of operons which is the count for each species, so that would be say some count and then divide by N genomes, that will give me the average number of copies per genome and the average number of RNs, right? And so if I look at species ASV, this takes a couple seconds to go, we now see that we have N genomes and ASVs NRRNs and so that looks good. Now if we come back down to our chunk here and run it, let's see if we get that column and let's see, oh, and I missed N ASVs here, so I want to put NRNs, run that and we get the same problem again. And again, that's because there's probably one E. coli genome out there that or one or two that have, that maybe have eight copies rather than seven copies and so that's why we see a subtle difference there. I'm not looking for like a precise number, I just want it to be the same number. I mean, these genomes all have the same number of copies, they vary a little bit, again because of the operon or the gene detection algorithm that's being used and it seems to be very good. So something I want to show you briefly and this reminds me is the values, value fill, I believe and so let's give that a shot. So there's an argument and I'm spacing on it, so pivot, let's look at the help and let's see. Values fill, so I get a value fill. Okay, so if you do value fill, then when it makes an NA value like it did in this case, instead of an NA it plugs in a zero and so depending on whether or not you want it to be an NA or a zero or what the data represent, you decide what you want that values fill to be. So sometimes I work with, so like count data, right? So how many times do each of these taxes show up in a community and if I have it in a tidy framework, I might remove all the cases where it's a zero but then when I go back to wider but they need to be a zero because it's like a structural zero, right? It wasn't detected, it should be a zero count. So in this case, I don't think values fill, I mean, we're gonna get rid of this. So that should go away but I wanted to briefly show that you can tell pivot wider what that value should be. So how do we fix it? Well, I'm gonna do the similar thing that we did here where we're gonna say n are ns equals max n are ns. That looks good and now let's run it and now we see that again, if I put this filter back say here that the species values are all the same for Escherichia coli, the number of genomes is all the same, the number of our n copies is all the same. So those three columns will be collapsed into one row and then it's gonna be the regions, the four regions and the four NASVs that are gonna be pivoted wider. Great, so excellent and I am going to remove this output of E. coli and I will call this wide, I'll call it a count table, all right? What I'd like to do then is to output my count table and there's a handy package called Cable that we can use and what I wanna remind you is that we can do count table and we've already seen this here, the arrange function, right, and so say I wanna arrange by n genomes, this will be an ascending sort starting with those that have one copy, if I want the descending sort, I use DESC and this should put E. coli up at the very top and so we see that E. coli, Seminella, Bordetella, Pratesis, these occur at the top, right? And so this is, these are genomes that we have a lot of data for and so we can see that E. coli has more ASVs than genomes but it also has about seven copies per genome, which is great. Mycobacterium tuberculosis only has one operon but has 11 ASVs across those 180 genomes, right? So it's a bit of a different story than we see with E. coli. We can then also look at the top n and we can look at the top n where we say we want the top n of, say we want, let's do 10 and we wanna tell it what columns, so we'll say n genomes, top 10 and the output doesn't really change except we only have 10 rows in our data frame now. The package that we're gonna use is called, or the function is called Cable, K-A-B-L-E and this is part of the Nitter package, which isn't automatically loaded in my experience. So I'm gonna go ahead up here and put in library Nitter KNITR. I'll run that to make sure it's loaded. Come back to the bottom here and run this and what you'll see when we run this is that we get a table and this is actually the markdown format of a table and so that's pretty nice, right? There's all sorts of things that you can do with Cable. There's another package. I'll say C also Cable Extra. Maybe it's Cable Extra for extra bells and whistles that you can throw onto your table. And I'm gonna go ahead and make this into a pipeline and let's see, we'll pipe this to Cable and we can give it some arguments to make it look maybe a little bit nicer. So I'll say caption is the 10 most commonly sequenced species. And so if we run this, then we get a caption on our table. Another thing I see here in this NRNs column is that we have a lot of significant digits. So I could also say then digits equals two and this will then trim it to only two digits to the right of the decimal point, which looks pretty nice. And go ahead and save that. And if I knit it while this is running, it's a reminder to go ahead and subscribe to the channel. Be sure you click on the bell so you're notified when the next episode is released. So if you kind of scan down through here, and again, this is markdown format, you see the table is outputted here and it actually puts the caption at the bottom with that cable extra package. There's all sorts of things you can do to maybe make the output a little bit more attractive. There's other arguments that you can use in cable to adjust kind of the spacing of the column. I believe that's a line. And then you can kind of say left, center or right. So we could say as a string, right? So we could do R, C, L, and then let's do C, C, C, C. And so this then puts the first column on the right, the second column in the center, the third column on the left, and these others are centered as best it can. But I like the default. Thought that was pretty nice. So we'll come back to our terminal and I'm gonna go ahead and do make exploratory, 20, 20, 10, 15, markdown. This will again run like we saw before. I do a get status. I see that I've updated our markdown. I've created a new markdown file. That figure didn't change at all. So I'll go ahead and do a get add exploratory, 20, 20, 10, 15, all that. And then I'll get commit to add a table showing counts of most abundant species. Closes number 30. And then get checkout master and get merge. The issue branch, issue 30. Our studio is complaining in the background. I'm gonna ignore it for now, get push. And yeah, it thinks it deleted it. Do I wanna close this file now? Sure, we'll go ahead and close this. Go ahead and close our studio and return to my issue. I see it's built in both commit messages, right? And if I come back to my code, exploratory, and then my markdown file, I can look down. I see my nice figure. It looks better in this larger space. And this is what my table looks like. For me, again, these types of tables are game changers. Back in the battle days, before I knew about our markdown and cable, I would manually enter these numbers into Microsoft Word. If, say, a new version of the database came out and they updated what was in there, I would then have to come back in and manually change each number. It was just so tedious. But with this, I don't have to change anything, right? It's really nice and it's really convenient. Again, you see the caption is down here at the bottom. I'd probably prefer to have that up here at the top. You can play around with that cable extra package to learn more about how you can change that. And we'll probably do a deeper dive on that in a future episode. So, again, please be sure that you subscribe to the channel, like the video. If you have any questions or comments, please feel free to leave a comment below in the notes. Tell your friends about this. I'd love to see what you're doing with your our markdown documents. Have you ever made a table in R or wondered how you could do that? Give it a shot with cable and see what you can do. Also, remember that whether or not your data is in a tidy format really depends on what you're trying to do. We can make our data more narrow using pivot longer and we can make it wider using pivot wider. So, give those functions a shot, play around with them and see how you do. Keep practicing and we'll see you for another episode of Code Club.