 Hey there and welcome back for another episode of Code Club. Today we're going to take a page out of rock and roll history and look at different ways of representing distributions for different categories. We often run into situations where we would like to compare the distribution of data across different categories. We saw this in the last episode where we made a jitter plot, also called a strip plot, and a box plot to show the average numbers of copies of the 16s RNA gene in each taxonomic group in each taxonomic rank for the different species of bacteria in our database. Those plots are great, but at least for our case we have a strong skew to our data and I'm not convinced that either approach did a great job of handling the large amount of data in our data set. So today we're going to use R to make several other plots that should give us a better understanding of how the copy number varies by taxonomic rank. We'll revisit what we did in the last episode and discuss how to make one dimension histograms and density plots, effectively turning that strip chart on its side. We'll also make what I call two dimension histograms and density plots. These ladder plots are also called ridge line plots because they resemble the profile of a mountain range. They've also been called joy plots because they resemble the cover of the Unknown Pleasures album of the band Joy Division. Because unfortunately the band Joy Division took their name as a reference to Nazis, they missed out on the opportunity to be memorialized in data visualization fame. Don't be a Nazi. Even if you're only watching this video to learn more about R and don't have a clue what a 16SR RNA gene is, or what a gene is, I'm sure you'll get a lot out of today's episode. Please take the time to follow along on your own computer. If you haven't been following along what would like to, welcome. Please be sure to check out the blog posts that accompanies this video, where you'll find instructions on catching up, reference notes and links to supplemental material. The link to the blog post for today's episode is below in the notes. For today's episode, I'm going to do something a little bit different. You'll notice that our issue window, we see that we have one open issue where we're trying to collect resources. I haven't put much in there. But then we also have 28 closed issues. These are all the issues that we've worked through in previous episodes. So if I click on that closed link, I see that there's add an annotation layer to plot to indicate the average of averages, right? So this was issue 29 that we worked on in the last episode. I'm going to click on that to open it up. And I'm going to then say, let's reopen the issue. Because as I mentioned in my introduction, I want to look at different ways of representing the data. So I'm going to say, I'm a bit concerned that that the strip chart doesn't adequately show data well. Let's look at alternatives. And so the alternatives we look at, again, would be a box plot. I think we've already kind of ruled that out. But we'll talk about it again. We can also think about histogram. We can also think about a density plot. And then we can also talk about a ridgeline plot using a density or histogram. Okay, so again, I'm reopening this issue. And I'm going to leave that comment. And we'll then return to our terminal. I'll go to my project route directory. And I can then do get checkout issue 29, because we already created that. And again, if I do get status, I'll see that we're all up to date. And we're good to go. I'll go ahead and open up my Schloss Rproj. And I see that we're in our exploratory file, directory, and then also that we're in our console at our project route directory. And I don't know that it's necessary to keep running getwd, but that's what I do to reassure myself that I'm in the right working directory. I have not been disappointed yet. All right, so I'm going to go ahead and come back to this RMD file from October 5. And if I scroll back up to the top here, you'll see that in the beginning, we read in our metadata in our amplicon sequence variance data, and we then merge those together to make metadata ASV. We went through all this in previous episodes. And then in this section, we talked about plotting the number of our RN copies per taxonomic rank. And what we did, you'll recall, is that we grouped by speech group by genome ID. But we also used all those taxonomic ranks. So I could get the total number of copies of the 16 s gene per genome. And then we also grouped by the species so that you could get the average for the species, because some of the species had many, many genomes represented, whereas others maybe only had one. Of course, there's others that have zero. So, you know, we're only dealing with what's already been sequenced. And then we we pivoted that to then use our group by and summarize to get the average number of copies across species within each of the taxonomic groupings for each taxonomic rank. And if we run each of these code chunks, we can then see what the plot looks like. And again, down here in the bottom right is the wind is the plot that we wound up with at the end of the previous episode. And again, I think this looks pretty nice. One challenge with it, of course, is that there's a lot of over plotting going on. And you know, it's hard to get a sense of like how many points really are right in here for species versus family or whatever, right? Do the distributions really change a whole lot? It's hard to say, right? There's definitely more species than there are families. So is the distribution really meaningfully different? We'd like to see what that looked like. And recall, we played with things like, you know, changing the alpha, the transparency of the points, we also used a box plot. And I'll come back to that here in a minute. So what I'm going to do is I'm going to create another our code chunk. And again, we can do that with the three back ticks, and then introducing the brace with an R in it. And I'm going to for now, I'm going to copy down this code for building the plots. And maybe I'll just kind of do some, I don't know what you call it scrapbooking, or just, you know, doodling, doodling, if you will, with data visualizations. And again, this was what our strip chart looks like without all the extra ornamentation added to it. Maybe I'll add some extra space to the bottom there. Okay, so the first thing I want to play with is going back to the box plot. And again, x is going to be the rank, y the mean number of our N copies. And I'm not going to need any of this stuff. And so if we run and build these box plots, we do get a sense of the distribution, right? So like this horizontal line is the median. This upper line is the 75th percentile. This is the 25th percentile. This line, though, extends to be, I believe it's one and a half times the difference between the 75th and 25th percentile. So it's not really representing anything about the distribution, right? It's not a 95% confidence interval. It's something else, right? The other problem I see is that there's a lot of points up here. And there's, if we've jittered them like we did back here, we see there's a lot of points that are kind of being collapsed to fall on top of each other. So that's not ideal either. So what we want to do today is we want to think about how we can draw the distribution of these points for each taxonomic rank. Where we'll start, we'll be with geohm histogram. And histogram takes different aesthetics. So the x isn't going to be the rank. It's going to be the mean RRNs, right? So we'll have across the x, basically the y-axis here. And as I said earlier, it's kind of like flipping it on its side. And the y-axis, rather, will be the number of times that we see that many RRN copies. And then let's go ahead and we'll make this color. So we'll color each rank to be a different rank, right? So kingdom and species will be a different color. And if we run this, then we see a couple things. So first of all, I always do this. I say color when I really mean fill. Color is the color of the line. Fill is the fill, the fill in color, right? So that looks a little bit better. The other thing we see that I don't like is this kind of sawtooth structure to the data. And as you'll see that it's using bins equals 30. And so it's setting 30 bins across our 20 copy numbers, right? So this is going to be a case where like a species or a different taxonomic group had an average say between like, you know, three and four copies. What we can do, we could either say bins equals like 20. Or what would be better would be to do say bin width. So let me show you bins equals 20. That looks a little bit better. But you kind of see a little bit of that sawtooth here for this group between probably nine and 10. What I'd rather do is bin width equals one. So each bin has a width of one. And this will give us a more smooth shape to our distribution. The thing that kind of sucks about this also, however, is that the data are stacked on top of each other, right? So at the very top, you see phylum with this brown color. And you can't even see the red of kingdom because there's only one of each kingdom, right? I'd rather see the distributions kind of on a common basis, right? So they all go from zero up to whatever their count is. So that's not ideal. Something else we could do is we could say position equals say dodge. And that dodges the columns to be next to each other. This also kind of sucks. I'm not a big fan of that either. And so this I think, I think where histogram would work well, would be if we were to say filter, what would it be? Rank equals species, rank equals equals species. And then pipe that in. And I'll leave all this the same way. And so we can kind of see that this looks a bit better. Again, of course, we wouldn't need to put fill equals rank in here because there's only one value there. But again, I think the histogram probably works best. We only have one category where you're trying to make a comparison. Okay, so this isn't what we want to do. An alternative might be to do geom density. I'm going to go ahead and remove this filter rank, and remove all these arguments. And here we get instead of stacking, we get a layering, right? And so it's kind of hard to see what's going on behind these. So I'll do is say alpha equals a point two. And again, you can kind of see the data stacked on top of each other. It's a little bit hard to see what's going if because there's there's really no separation kind of into the screen, right? The other thing I kind of see is the sawtooth look to the curve. And so what geom density does is it basically takes what we previously saw with the histogram and fits a smooth curve through it. And there are ways to manipulate what that curve looks like. So if we do question mark geom density, there is an argument that was called, and why am I forgetting it? Pretty sure it's adjust. Where to go down here? Yeah, adjust. So it's a multiplicative bandwidth adjustment that makes it possible to adjust the bandwidth while still using a bandwidth estimator. So good. So what we can, we can say is say adjust equals, let's say 0.5, 0.5. This makes it look even more jagged. All right, if we do one, I think that was the default. If we do two, that makes it look really smooth, right? And then this 1.5 makes it look a little bit more jagged, right? And so again, we can kind of see the data that these distributions really do tend to fall kind of right on top of each other. If you've seen me talk about data visualization though, you know that having these seven colors that really to me like kingdom and species don't look that far apart, it's hard to know which curve in here is which. And so while the shapes largely look very similar to each other, it's really hard to distinguish what's going on for each of those taxonomic ranks. So I'm not such a fan of this. Now, so this again, what I mentioned was kind of like a 1D histogram or density plot, right? Where there's no separation through the screen or to the back of the screen, right? Everything is just layered on top of each other. There's a package called ggriges, which I already have installed and I would encourage you to install as well, ggriges. And what that allows us to do would be geom density ridges. And the arguments are an X. So we're going to put me and RNs on the X and then the Y axis will put the rank. And I'll show you what this does here in just a second. And so again, what we get is we get a density plot for each of our taxonomic ranks. And you can hopefully see how this kind of looks like that album cover from the Joy Division band. It doesn't show Kingdom, I think, because there's only two levels within Kingdom or Kia and bacteria. And so it probably can't fit the data very well. So instead of the density, we can use stat equals bin line. And this will show the size of the different bins, right? And so it's much more of a histogram look. Again, we see this sawtooth look and it tells us that we can use bins or bin width. So let's go ahead and do bin width equals one. And that looks a lot better. One thing you'll notice is that the histograms kind of overlap on top of each other. And we can change that with the scale argument to geom density ridges. So if I do scale equals one, this makes it so that they don't overlap at all that the top it's scaled basically so that the tallest Kingdom is the same height for all of the other taxonomic ranks. And so they don't overlap on each other, right? So if I did scale equals two, then they all overlap on top of each other. That's much more of like what that album cover looked like. If I did say 0.5, then there's more separation. Maybe I'll do 0.9. So there's a little bit of separation across them. And again, I can see the shapes of the distributions. And they really do look pretty similar to each other. I can now go ahead. So I like this. I'm a fan. We can go ahead and clean it up a bit. And I'm going to steal some of our styling from up above here and repurpose it. And I'll go ahead and add the theme classic x. So it's going to be y equals null x is going to be the mean number of copies per genome. Let's see title I'm going to change to be the distribution of our n copies per genome is pretty consistent across ranks. Okay. And then each point represents a sequel taxon within that rank numbers based on an average species copy number. So I think that's fine. And instead of scale x discrete, we want scale y discrete. So go ahead and set that. So that looks pretty good. And again, when we render it as a knitter document, the plot will look better, we hope. And what I'd like to do is go ahead and add a point to indicate where the mean is. And actually, you know, the mean is really only relevant if we are drawing, if we're describing a normally distributed data set, these are not normally distributed. I think what would be better would be the median. And so for my median of means, I will make that median of means. And then this will be a median mean are our ends. And this will be the median rather than means, right? And so again, we're going to use n row median of medians. And this will then be, that's fine. This only came in when we were putting up a summary value like we had in this GM segment. So this would be median means I think that is good. This will be median. That also. All right. And okay, so let's see, I think this all works well. And like I said, I'm going to add GM point. And my data is going to be median, median of means. And as so x is going to be median, I forget the column name already. And I actually never loaded it. So let me go ahead and reload this code chunk. And that all ran without an error. So I updated everything appropriately. And again, my median of means has a median mean are ends. And so that's going to be my x. And then my y, I'm going to set to be one, to n ranks, right? And we'll see how that works. Yeah, so we get a little dot on our plot, don't necessarily want it right on the x axis, maybe I want it up a little bit. So I'll do y plus say point two, it's not too high. And that looks pretty good. Maybe I'll add to my title. Or my subtitle, the dot represents the median for the rank. So that looks good. So I'll go ahead and knit this. It's complaining. So I'm getting an error that it can't add this to a theme object. And I'm kind of not sure what that means. But I do notice I'm missing a plus sign here. So maybe adding that will make it go away. Did. And so again, we get our nice themes, our nice taxonomic ranks on the y axis, that looks pretty nice. I'll go ahead and retry to try to re knit that. That works. I'm going to go ahead and open up is exploratory 2020 10 on project copy number, trying to find the actual image. So it was in files. Just keep hitting tab eventually we'll get there. And so I think it's chunk three dot one three dash one. And so we see again, what the plot looks like. And that looks that looks pretty nice. So of course, we have, we have two things going on here, two different types of plots. I think what I'll do is maybe I'll move this code chunk up up here. And that way we'll have two figures. And I'll say here's another way of looking at the data. Save that, knit it. And we're good. And I want to again, we want to double check that this works with make make nothing to be done for that, which makes me worried that I don't actually have a make file make rule for this. And if I open up my make file and come back down to the bottom. Yeah. So I don't have I don't have a rule. So let's remember how we make the rule for that. And I'm going to copy and paste some names here. So this is the target. This rmd file is the dependency. And then the data that we need to build this, I think is going to be the same as these, right? So I'll go ahead and copy these. And remember, if we go over multiple lines, then we need to use the backslashes. And I'm going to copy this recipe to build the rule. And we'll change the name to be this. And if we wanted there's now that we've got several rules here, there's probably a way that we could clean this up a little bit to make it cleaner. But I think it's okay for now. Good make that and it tells us it's up to date. And we're in good shape. So I'll go ahead and get status. And I will go ahead and get add my make file and all my exploratory stuff for 2020 10 five. And that's all good. And we'll do get commit dash m, and we'll say create a ridgeline plot of average number of our n copies for species. I think we're now closing it closes number 29, for sure. Get checkout master get merge issue 29. And we can get push. And we see it closes it out again. Let's come back and look at what this looks like in our exploratory directory. So we know that everything is good. We'll look at the MD file. It's really nice that markdown embeds this stuff into the web page. So it's easier to look at. And sure enough, we see it here. That looks really nice. You know, something we might think about doing would be to perhaps trim this at like 15. But we do have some species that come out here to 21 or so. So anyway, I hope this has been interesting and gives you another way to think about how you can look at the distribution of your data across different levels of a categorical variable. This is something that comes up a lot. And I think it's particularly useful for cases like ours where we don't have normally distributed data that we've got this long skew out to the right. And so it really helps us to see what the data looks like in a way that I don't think our box plot did a very good job of. And perhaps our our jitter plot, maybe overemphasized what was going on out in the tail. Anyway, I think this is an attractive way of looking at the data and and tells a pretty convincing story that the distributions really don't change much for at least, you know, species through family or order level, which is mainly where we're going to probably be looking at those region, those levels for most of our analysis. So keep practicing with this. Again, I'd love to see what you're coming up with. How might you use a ridgeline plot in your data analysis? And if you have any questions, feel free to put them down below in the comments. And until next time, we'll see keep practicing and we'll see you for another episode of Code Club.