 Hey, welcome back for another episode of Code Club. I'm your host, Pat Schloss. In my experience, the gateway to learning more about the tidy verse in R is learning how to make plots using the ggplot2 package. The syntax with ggplot is relatively straightforward and, dare I say, intuitive. If you don't believe me, try making plots that you currently make without ggplot. The generic syntax to building a plot is to have a ggplot line where you define your data frame and your aesthetics. This line is followed by a line with a geome, indicating how you want the data represented in your plot. But did you know that you can use multiple geomes to plot data on top of other data? Not only can you use multiple geomes, but you can also use multiple data frames to generate your plot. In today's episode of Code Club, I'm going to show you how you can display your data using multiple geomes and multiple data frames. We'll see how this plays out in today's episode, where we'll revisit a plot that I made in the last episode that plotted the number of copies of the 16s rRNA gene for each genome in each taxonomic group at any taxonomic level. Even if you have no idea what taxonomic ranks or taxonomy is, and you're only watching this video to learn more about R, I'm sure you'll get a lot out of today's episode. Please take the time to follow along on your own computer. If you haven't been following along but would like to, welcome! Please be sure to check out the blog post that accompanies this video, where you'll get information on catching up, reference notes, and other materials that you'll find useful. The link to the blog post for today's episode is below in the notes. So I've already created an issue branch in my terminal. Again, I can do get status and see we're all good, we're all up to date. I'll go ahead and open up RStudio, and I'll double check that I'm in my working directory, correct working directory, and that's where I want to be. Again, if I look at files, I see in my exploratory data analysis that the file we were working on last time was from October 5th. I'm going to keep working on that and see if I can't make this window a little bit smaller. Okay, so to kind of bring us back to where we were, let me go ahead and rerun this initial code chunk that was at the top, and I can do that in RStudio by hitting this play button. It loads and runs that code chunk, and then if I go ahead and run this here, we see a plot in the lower right. You know what? I'm going to try to live a little and turn off these editor options, chunk output, console, and see what happens if we do it from within the window here. And for some reason it's not putting it there. Maybe I screwed something up. Anyway, we can see what it looks like down here in the bottom right. And what I'd like to do, again, is have something on top of this. And up here in this part of the code chunk, let me just come back. So up here we read in our metadata. This had our genome ID taxonomy information, the taxonomic information for every genome we had. We also read in the ASV file. So we know which ASVs correspond to each genome and how many times they show up in each of those genomes. We then joined those together here on line 25. Then we went through a series of steps so that we got a representative average number of copies per species. And we then made our wide data frame tidy to create a series of columns where, again, if we look at rank, taxon, or ends, we see that it has three columns indicating the rank, the taxon within that rank, and the mean number of copies there. And what you'll also notice is that rank is a factor. And so the nice thing is that we see on our x-axis then that our categories are plotted in the order of that factor. So that's great. And so again, as I was looking at this, the points over here get kind of muddled. I'm using GM jitter with an alpha of 0.3. And you'll see that if you have basically three points on top of each other, that's going to be a solid black point. I could go to something like 0.1, but that's really, it's going to make everything else really faint. And it doesn't really help me see where the average is, especially for species or genus. So I'll go ahead and put that back to 0.3. And what I'd like to do is show you again that we can put geomes on top of geomes. So what we'll do is we'll do geom boxplot. And we'll keep that with no arguments. And what this is going to do is it's going to use all the same aesthetics that geom jitter had used. So if we go ahead and run this, we see now that we get our boxplot on top of our geom jitter. So there's a couple things that aren't great here. So first of all, what we notice is that, well, we don't have the same alpha, but it's also double plotting the points. So this solid point for phylum is an outlier in geom boxplot that also shows up for geom jitter. So we're double plotting points, which is not great. The other thing is that our boxplot is on top of our jitter plot. So again, this is why it's really important to remember that we're looking at layers, that if I move geom boxplot ahead of geom jitter in my pipeline here, now all my points will be on top of my geom boxplot to the point where I really can't see what's going on there. All right. So I think I would like to have my geom boxplot on top. And so there's a couple of things that this is white, and it's that the fill color on the box is white. And so it's masking everything else. And also we've got these outlier points. So I always forget how to do this. But if I do question mark geom boxplot, it will tell me how I can turn off those outliers. And so we see there's all these arguments for setting characteristics or attributes of the outlier. And down here, if we look at the section on those arguments, it will see that sometimes it can be useful to hide the outliers, right? So we want to use outlier shape equals na. So again, if I come back and put outlier dot shape equals na as an argument for geom boxplot, we run this, we now see that we no longer have that double plotting going on. And those outliers for the boxplot are gone. So the other thing that again stands out a little bit is that the fill color is white. And I would rather have fill equals na. And so that will say no fill, right? And so that looks pretty decent, although it's really hard to see what's going on. So one thing I might do with my geom jitter, instead of alpha equals point three, maybe what I'll do instead is to do color equals gray, right? So this this looks fairly decent. I don't know that I really like it all that much. It does show kind of the bounds on the distribution, the edges of the rectangle indicate the 25th and 75th percentile. The line in the middle is the median. The line extends as far as one and a half times the difference between the 75th and 25th percentile. So this works, this looks nice. It's one way to do it. I'm not a big fan of over plotting a boxplot on top of a jitter plot. I think it's just kind of, it's kind of messy, kind of noisy. So let's see what else we could do. Okay. So again, this was showing how we could plot one geom on top of another geom. What I will do now is modify this. And what I'd like to do next is perhaps put a single dot on each column to indicate the average, right? So this is going to force us to think differently because our rank tax on our ends contains, you know, a value for every one of these points being a different taxa within that rank. So I need to get, I need to summarize this to get kind of like a mean of means. So I'm going to call this mean of means. And I'm going to remind you from last time what we did, where we grouped by, in this case, we're going to group by rank, and then summarize. And to do mean of means equals the mean of mean, or ends. Okay, maybe I'll call it mean, mean, or ends. Because sometimes it's weird to have the data frame have the same name as one of the columns, but it is what it is, right? So let's see. Ah, it reminded me because I forgot. We'll do dot groups equals drop. And that will go away. I mean, it's effectively the same thing. It's doing the same thing, right? It's getting rid of that last layer of the onion, if you remember my metaphor from last time. So now if we look at mean of means, we see for each taxonomic rank, the average number of copies that we're seeing across all the taxa there. So again, I would like a point at each of these values across my ranks. And I'm going to remove the geom box plot. I kind of like this look with the color gray, thinking of it kind of sitting in the background. What we'll do is we'll add geom point, right? And if we ran geom point with the current AES, the current aesthetics values from up here, then it's going to plot all those x and y values, which you see as a column of the points, whereas the geom jitter is the same thing as geom point, except that it jitters the x position. So I don't really want to do what I have here. I want the geom point to come from my mean of means data frame. So one thing to note about how I have this written here is that these first two lines are really the same as doing this, right? Where I say data equals rank tax on our ends, right? And again, we get the same output. I like to pipe this in, because sometimes I might want to modify that data frame before I start plotting it. And so it's easier to kind of have it outside of that gg plot line. But I want you to look at the structure that we have here, because we can use this structure for any of our geoms to set new aesthetics and to set a new data frame. So what I can do now is say data equals mean of means, and then AES, right? And x will be our rank, and y will be our mean, mean, our ends, right? And I think that looks good. And what we see now is that we get a black point in the middle of our columns, right? And we could do the same type of thing where here we have to, yeah, we can modify other attributes like color, right? And so here we modified the color of the jitter to be gray. Well, we can give a different color to our black point here to say be color equals red, right? And so now we'll have a red dot. Perhaps we'd like, let me break this up a little bit to put on different lines. Perhaps we'd like to say shape equals star. I just learned recently that you can name the shape without actually having to give it a pch number. And so I wanted, I guess, maybe asterisk. Asterisk, asterisk. See how that works? So that kind of works. I think I really like the default solid circle. Maybe what I'll do is I'll make size equals two to make that point a little bit larger, okay? So that's another option. And again, what we're seeing in this example, of course, is that we can have a different data frame that we're plotting, along with a different set of aesthetics that we're going to be using here. And one thing I want to experiment is if I remove that x equals rank, it also works, right? Because it's going to take that x equals rank from the previous aesthetic down to this one. Now, an argument that we could use, if we need, would be to do that ignore AES equals true. Let me look back at geome point. And this is one of those arguments that you wonder, what does that even mean, right? So inherit AES, okay. Inherit equals false. Right. And so what this means is, don't bring down the AES values from the ggplotline or other geomes into this one. And so this should complain inherit dot AES. And so it complains, because it's missing the aesthetic x, right? So you could say x equals rank, y equals mean of means, and we get that. So we don't really need this inherit AES here. Wanted to show it here, show it here. So you can know that perhaps you might set an aesthetic here that you don't want brought down here, right? So maybe I could do, you know, size equals mean or ends. This is going to look really bad. So let me get rid of this. Hold on, this is going to look horrible. So the size of the points is related to the size, the number of copies, right? And so perhaps, that actually isn't doesn't seem to be affecting the size of my mean value too much. Oh, and that's because I have size equal to here. So if I move that, then I think my red circles should also grow. So now it complains, because it says object mean are our ends not found, right? And so because I'm giving it a data frame here, data equals mean of means, it doesn't have a column mean are our ends, right? So again, this would be a perfect place to put in inherit AES equals false. Okay. Right. And so now my circle is, you know, the same size for all values of that average, right? Of course, this looks really ugly. Even me with my bad taste knows that that looks bad. So I'll go ahead and remove that size, and run this. And it looks good. I'll go ahead and leave that inherit AES equals false, because you never know what might happen down the road. It'd be good to turn that off. So we're not worried about getting those errors. Okay. So again, that's putting a point there. What if we put a line across, right? So I'm going to show you that we could put three geomes together. Let's do geom line, right? And we'll do actually, I don't I think it will actually inherit inherit from geom point. So if we run that, then that looks weird. So x equals rank, y equals mean are our ends. Yeah, it's not really doing what I, what I expected. So I think what it's doing is it is going from the rank to the mean are our ends, but it's still using this global declaration at the top of ggplot. So we probably do need all this. So if I copy this down, as an argument to geom line, run that, it's now complaining because it's like you only have one thing in each group. And so what we can do to get rid of that would be to say group equals one, run that. And now we have a line connecting our red points on top of our jittered data, right? We have three geomes here. That's pretty, it's pretty slick. Let me go ahead and remove the line, the geom point though, because I don't know that I really like that. And so that again shows the line on top of our different taxonomic ranks. And again, we could do things like size equals two to make that line a little bit thicker. What happened? I forgot a comma. And so we got a thicker line there, right? I think what I'm ultimately going to like is what we'll do next, which will be to put a segment across the cloud to show the average value. So what we'll use is geome segment. And geome segment is going to require two sets of variables. So it's going to require an x, a x and and a y and y and. And so the the x is the starting point of the segment, x and is the ending point of the segment. So I'll want something from here to here, and then something from here to there, right? And then the y, it'll allow you to make, you know, a diagonal line or a horizontal line or a vertical line. So my y and is going to be the same as my mean mean RNs. And let me go ahead and put in some line breaks to make it easier to read. And my x rank, if I go ahead and put in rank, let's see what happens if I put rank for both of them. Not much happens there, right? And that's because it's starting and ending at the same spot, right? And so we need to give it some separation. But if I notice my width is 0.3, if I do rank minus 0.3 and rank plus 0.3, then hopefully I can get that separation. So that doesn't work. It complains, because plus and minus isn't meaningful for factors, rats. So what we want to give here instead is a number that corresponds to each taxonomic rank, right? So we have kingdom phylum class family order kingdom phylum class order family genus species. So we have seven layers, right? So we can define, we can create a vector with one colon seven. So if I come down to my console here and do one colon seven, I get numbers from one to seven. If I do one colon seven minus 0.3, I get starting positions. If I do one colon seven plus 0.3, I get that spot plus 0.3. And so it's showing the names, but it's really 1234567, right? So we can do one one to seven minus 0.3. And we can also do plus 0.3. And now we see that we have our red segment, something that we can also do as a argument for line geome segment would be line end. And we can say round. So because it has these like squarish ends. And again, I forgot my comma. And so that gives kind of a rounded shape to my line segment. I kind of like this output more than what we've been playing with previously. A couple of things we might do to clean this up. Aside from so I'm going to go ahead and pull this out to pipe that in. Run it every time I change something to make sure it works. And, you know, if I want to play with this jitter, I could say jitter width equals say 0.3. And then I could say width equals 0.3, or equals jitter width minus jitter width plus jitter width. I could say n ranks. And that I could then say, I could get a number. So instead of being one to seven, I could get all the ranks, right? And so I could then say, well, this is really the number of rows in the mean of means, right? So I could say n row mean of means is n ranks. And so n ranks is seven, right? And so if I do one colon n ranks, I get my one through seven, right? So here I can do n ranks. And the reason this would be relevant would be because say I so here I'm not including the strain level, right? Well, maybe I want to go back to the strain level. And instead of having to change the seven to eight and all think about all these things, this will automatically get get it for me. And if I want to do say 0.2 and rerun this whole chunk, then it'll be narrower and my segment will be the right width. And if I do say 0.5, it'll be wider. And my segment will be the same width, right? So I think 0.3 was really what we liked. And I'll go ahead and build that out. And so that looks good. Excellent. So one last thing I want to do because this kind of annoys me across the bottom are these labels. So I can do scale x discrete. And I will say let me I always kind of forget the actual arguments. I think it's breaks and labels. Let's try that. Breaks equals kingdom, phylum, class, order, family, genus, species, right? And my labels, I'll capitalize them, right? So it's the little things that annoy me, like capitalization. And there's probably an easier way to do this title case, probably with like a function of some type. But I think this this will serve us pretty well. And so we see that, yeah, I guess the argument's right. Breaks and labels for scale x discrete to change the names that I have here on my x axis so that they're all capitalized. Good. I'm going to go ahead and save this. Tell me what you think down in the comments below. Which of the four different ways did you like better? So over plotting with the box plot was option one. Option two was putting a dot on top of the cloud to show the average. Option three that we looked at was using a line and a line with a point, perhaps. And option four is what we see here with GM segments. So tell me down below what you think looked better. Next time, we'll talk about a different approach to looking at these distributions. Because we don't really have a sense, I guess we did from the box plot of kind of the shape of the distribution. So next time, we'll talk more about how we can look at the shapes of these distributions beyond just the average. For now, I think we're in good shape. I'll do get status. I'll add our exploratory 2020 5. Yep. And I forgot to knit it actually. So let's knit our whole document. That works. We'll go ahead and close that. And if I come back here, let me control C out of that, get status. Yeah, so I should have noticed that before all I had changed was my RMD document, but now I've updated my image, as well as my markdown file. And I can then go ahead and get add on my exploratory 2020 1005 plus star so I can get all those files and then get commit, add line segment to cloud of points to indicate average closes number 29. And perhaps I could also have improved my subtitle a little bit to show that the the red horizontal bar indicated the average, but I think that's good enough. Again, we're still doing exploratory data analysis. We don't want to get over. We don't want to go overboard with kind of making our plots look too pretty, right? So I'll go ahead and get checkout master, get merge issue 29 on push. And then we'll look up here on GitHub. And we'll see pretty soon here that this issue has closed. And we're in good shape. And we can come back to our code to exploratory. And if we then look at that, we can, the code is still a jumbled mess. But this is what our plot looks like on the website. And I think that looks really nice. One thing I noticed is that the resolution here is kind of crummy. But again, for exploratory data analysis, I think this is great. And it again, iterating through these different ideas gives us a sense of different ways that we might want to represent the data when we go to write a paper. The other thing that, you know, we see is that by iterating through these different ideas really quickly, I think it really shows the power of GG plot that here in just, you know, 20 or 30 minutes, I was able to generate, you know, four or five different plots really quickly to get a sense of how things worked. So again, feel free in the comments down below to tell me what version you like the best. This is what I prefer, but that doesn't mean it's the best way or the only way. See, if you know, you've got applications on the work you're working on, whether or not there are places where you would want to layer different geomes on top of each other, or like we did in this example, layering different data data data frames on top of each other with different geomes. So hopefully this exposed you to something a little bit new than you've seen in the past. And again, love to hear what you're doing with the material that we're covering here in Code Club. Keep practicing, tell your friends about it, be sure that you like this video and you subscribe to the channel, so you know when the next episode is released. Talk to you next time.