 Hey folks, I'm Pat Schloss and this is Code Club. In recent episodes, we've been looking at different strategies for depicting the relative abundance of various microbial populations in different disease status groups from data that was in a paper that I published a number of years ago. I don't quite remember when. Anyway, we've been using these data to look at things like stacked bar charts, pie charts, heat maps. Well, I want to go back to those stacked bar charts because two of the critiques that I had of stacked bar charts is that you can't see the variation in the data for those individual taxa. Again, this is a problem with all of these approaches that we've looked at so far. The other problem is that you don't have a good sense of the number of observations that are being aggregated together if you're depicting the average in a stacked bar. Again, it's a common problem also for what we saw with the pie charts as well as the heat maps. So in today's episode, we are going to take on those two critiques and see if that gets us any closer to being fans of stacked bar charts. To do that, we're going to use geom call from ggplot2 as well as facet wrap and facet grid from ggplot2 so that we can see different panels for each of our disease status groups with individual subjects arrayed across the x-axis and the relative abundance on the y-axis. I strongly encourage you to follow along with me today so you can get the most out of this material. I don't think I'm that entertaining that you just want to watch me code. I think you want to engage the material yourself. So if you want the code that I'm starting with down below in the description is a link to a blog post that's associated with today's video where you can get the code that I'm starting with. Also across the top here is a link to a video where I show how you can install our studio, get the data, install the tidyverse and I think that's about it. Also, if you like this type of material, be sure to check out the link down below where you'll see information on some workshops that I teach that use tutorial materials that are at the riffamonus.org website called general r and minimal r. The minimal r uses microbial ecology data like this, whereas the general r deals with data that are not microbial. And know that I use those materials as the basis for three-day workshops and I have a number of those coming up over the course of the summer and would love to have you join me. Digging into this r script that I have, you'll see that we are loading a variety of libraries from the tidyverse to get all that great functionality available to us. We're reading in metadata information about the sample idea or the subject, as well as their disease status. These are data that are coming from a C. difficile study. So if you haven't already, you'll hear me talk about diarrhea way too much. Anyway, we then read in our O2U count so we know how many times we see each taxa in each sample. We know the taxonomy from the taxonomy file. And then we join all that together. And then finally, we do some styling to get data only at the phylum level, something we've seen in previous episodes. If we look at two finest scale sake of a genus level, then there's just too many little tiles in our stacked bar chart to make it decipherable. So we'll go about looking at the phylum level, aggregating to get the mean relative abundance, and then we do some styling to insert markdown so that when we use element markdown in our theme, we'll get that nice actualization. Next, we go ahead and we pool our data that comes from Fila, where the highest abundance or average abundance for each of the disease status groups is less than 3%. This is useful, again, for getting rid of kind of the shrapnel Fila, if you will, Fila that have facets that would just be way too thin to actually be able to see and just kind of adds to the number of colors in our color palette, making it harder to distinguish between different taxa. So this will give us four Fila, as well as an other category, and that's pretty manageable for us to decipher between those five different categories. Finally, we join this all together, and we then build out the stacked bar chart. Let me go ahead and build this so we know that it works. Got our old friend, our stacked bar chart, each of these three columns represents the average relative abundance of the Fila across the individuals who are healthy, those who have diarrhea but are C-deficial negative, and those who have diarrhea but are C-deficial positive. Again, we are masking the total number of individuals in each of these three columns, as well as the amount of variation in the relative abundance of each of these Fila. The first thing that I'm going to do as we kind of march towards making separate panels for each of the three disease status groups is to go ahead and refashion this so that across the x-axis we put our various subjects that are in the study. We'll come back later and then separate that back out again by disease status. Coming back to our code, we want to remove that aggregation step where we aggregated within each of the three disease status groups. So we'll go ahead for now and turn off the scale x discrete. I'm going to comment it out because we'll probably want some of that code later. I'm going to come all the way back up here and let's look at taxon relabund to see what that looks like. So we've got our disease status, our taxon, and the mean relative abundance. So here is where we are losing our sample ID. I will go ahead back up here and kind of step through this. So I see that we've got sample ID here as a group by variable that we then calculate the total relative abundance for all the otus, all the individual tax that are part of a phylum. Let's go ahead and run this and see what the output is giving us. And so we see that we have the disease status, the sample ID, the taxon, and the relative abundance. This next step is where we aggregated by disease status and taxon, and then we got the mean relative abundance across all those. So I'm going to remove these two lines and put my hundred times up here to get my percent relative abundance. And so now if I look at taxon relabund, I now see that I've got my disease status, my sample ID, my phyla, and the relative abundance. Turning now to the pool step, let's go ahead and run this to see what kind of output we get. And it's complaining input pool mean relabund not found. And that's because we defined relabund above instead of the mean relabund, because again, we're looking at the individual level in taxon relabund. So we'll look at this and we see now that we've got our phyla, the taxon column, pool, whether or not things should be pooled, and then the mean. And what we're finding is that something like tinobacteria, where the mean is quite less than 3%, 0.659% is not going to get pooled. And again, that's because it is saying, do any of the samples have a relative abundance of a tinobacteria greater than 3% rather than do any of the disease status groups have an average greater than 3%. So let's come back up here. And again, we'll look at taxon relabund, where we've got our disease status or sample ID of taxon and relative abundance, we can then group by our tax on but also our disease stat tax on. And I'm going to hold on to that summarize statement for a moment. And we'll do summarize. And I will do mean equals mean on relabund. And if we run those lines, we now see that we've got our disease status, our tax on and our mean. See that we see there's 39 total rows. And that's because there's 13 phyla and three disease status groups. So that's good. I'll go ahead and add dot groups equals drop. We can then group this by tax on to get the relative abundance across all samples within each of the three disease status groups. So we will then summarize pool equals max of mean not relabund. So again, if the maximum mean value within each of the three disease status groups is less than 3%, then we're going to pool that phylum with all the other file that are pretty rare. And then the mean is going to be the mean of the mean, we'll use mean to determine the order of the tiles in our stacked bar plot. And so we'll do that based on the average across all three disease status groups. Again, looking at tax on pool, we see that we have our taxa, whether or not they're going to be pooled, and then the average relative abundance across the three disease status groups. And then we've got 13 phyla there. So that checks out with what we've seen in previous episodes. So again, we'll load tax on pool, we'll then bring that into our inner join where we're going to join our tax on relevant at the individual level with the information about whether or not to pool, and the order of the different taxa in our in our stacked bars, we're going to join that by the tax on, we'll then mutate that so that if if it says pool, then we're going to rename it other. Otherwise, we're going to use the actual name that we defined up above. We're going to group by disease stat and tax on, we actually also want to group by sample ID, so that we're keeping things at an individual level. What we're doing in this group by actually is that we're grouping to aggregate together those new taxa that were re labeled as other, and that we need to aggregate those together. And so we'll call that relevant, and we're going to sum the relative relevant column for those, those rows or those phyla that were too rare. And then we're going to retain our mean column by reporting back the mean and that should be the same, I think across those three disease status groups. And again, if we look at this output, we can see that we've got our sample ID disease status, the taxa, the relative abundance, and then the mean, again, for the phylum across all three disease status groups in this next mutate chunk, chunk, we can then set the order of the phyla according to that mean column. And we do this fct shift, so that we've got our fremicides at the bottom and the bactroid eddies at the top. Again, that doesn't really change much in the output. But we can now feed this into ggplot. And instead of on the x axis putting disease stat, we're going to put sample ID, y is going to be the the relevant, not mean relevant, we change the name. And we're going to use geom call. And we're going to use all our other colors. I'm just kind of quickly look through here and make sure everything looks good. So we'll go ahead and run this. And voila, we have all of our samples across the x axis, each individual is a separate column, it's impossible to see here. So what we want to do now is kind of pull this apart into the different disease statuses. Before we move on to that, though, I want to clean up this x axis, you can't see it down here because there's like 350 sample IDs, but there's 350 sample IDs smushed together here, as well as 350 or so tick marks. So let's go ahead and turn those off, we can modify the axis back up here and axis text x, instead of element markdown, I'll do element blank. And that means don't show anything on the x axis. And then we'll do also axis dot tick, ticks dot x, I'll do element blank as well. Yeah, so now we have that clean x axis edge, we don't have all those sample IDs. Another thing that you can kind of see in here are random striations in the, in the columns. If you were to zoom really far in, you would notice that there's actually space between each of the columns in our stacked bar plot, we saw that earlier in the one where we only had three, right, there was natural spacing in there. And that's what we're seeing in these random vertical white lines. I can turn that off by coming back here up here to GM call and do width equals one. So now we got rid of those vertical white lines, things look a lot better, although they kind of still look crummy. We've got our sample IDs arrayed across the x axis, but I'd like to pull things apart by disease status. We can do that by adding facet wrap. And so I'll do facet wrap tilde disease stat, and row equals one. So this will make one row with separate facets for each of the three disease status groups. Try add my plus sign there. And now what we've got is we've got three facets, but the x axis for each of the three facets is the same, right? It goes from, you know, subject one to subject 350 for each of the three facets, I can turn that off. However, back up here in facet wrap by doing scale equals free underscore x. And so what that means is let the x axis be free to only include the data or the range of the data that's needed to represent the data. Now what we see is that we have three facets, we don't have extra white space in the three facets because we've we use that scale equals free x argument. Something I do notice, however, is that I'm pretty confident that I have a lot more subjects in this non diureal control healthy disease status group than I do in either of the diureal control or cases. What facet wrap is doing is it is showing the facets as the same width. I can change that by using facet grid. And using an argument from facet grid. So we'll turn off and row because facet grid doesn't use and row, because by default facet grid put something in the columns and something else in the row. So here we're putting disease status in columns, but we don't we only have one row. So we don't need that. And then we can also do space equals free. What this does is it allows each of the facets to be proportional to the number of samples or the you know, the range of the x axis in each of the three facets. And sure enough, we now see that our nine diureal control is about the same width as the diureal control in cases together. And so we can now see that we have more dot non diureal control samples than we do in the other. So we've kind of taken care of that n issue that I was worried about earlier. Something else I'd like to do is clean up the order of my subjects. They're not random here, but they're in order of their subject ID, which is basically random. What I'd like to do is let's sort them by the relative abundance of the formicities. To do that, we'll come back up to the top here and create a new data frame that we will build, where we show the order of each sample, each subject by the amount of formicities that they have. So I will do tax on relevant bund filter tax on equals and then star for micutes. So remember, we put stars around the filenames so we can get that italization in the legend with element markdown. And what we see then is that we've got our disease status or sample ID, tax on and relabund. I now want to arrange this data frame by relabund. And I'm going to do a descending sort. So DESC on relabund. And so now what we've got is our samples, sample IDs, our taxa and the relative abundance, as well as their disease status, which I don't really care about. Next, I'm going to mutate to add a column to indicate the order. And so we'll do one colon n row. And then in parentheses the periods that n row period will say go from one to 350 or however or 338, the number of rows in the data frame. So now we'll get a column that has the order that we want from one to 338. So I'll call the sample order. And I will also clean this up a little bit because all I need out of this is the sample ID and the order. I need to select. So sample order again has our sample ID and the order were in good shape. Okay, let's go ahead then and we're going to do another join to bring that data in. I'll do that down here. After we've already pulled together all our file, so I'll do inner join period sample order by equals sample ID. And then we see that we've got our sample, our disease status taxa relative abundance mean and the order. So that's all good. We then need to modify that to make it a factor that we then order by the order column. And we can then do mutate sample ID to be a factor on sample ID. Okay, and we can then do sample ID, FCT reorder on sample ID by order. And that'll be good. And then pipe that into ggplot and everything else should work. Excellent. Now we have our individual facets. The samples within each of those is ordered and decreasing order of abundance by the relative abundance of the formicities, showing you the amount of variation that we have here. Does this look good? We'll come back to that in a moment. The last thing I want to do with this figure is to bring these panels down to the bottom and make them x axis labels. To do that, I'm going to come back up to facet grid and do switch equals x and quotes. And that means switch the position of the x axis facet label. So the columns, right. And so now we see we have our facet titles across the x axis on the bottom. In our theming, I'm going to do strip dot background element blank. This gets rid of the border that we had around our facet. But one thing you might wonder like why are these underlined? Those aren't underlines. That's actually the x axis. And so what we have is we have the facet labels on the inside of the plot. We need to change the position to be the outside of the plot. We can change that placement or position with strip dot placement equals outside. And now we've got our label on the outside below kind of looking like an x axis label looking pretty nice. And what we now need to do is clean up and make our labels look pretty. And I'm going to come back up here and make a vector that I'll call pretty so that we can use that with our label or function that we saw in a previous episode. And I had code down here that I commented out. So I'm going to grab this and bring that back up to pretty. Because I'm going to kind of recycle some text here. And we've got all this. And we're going to create a named vector. And so non-diarrheal control will equal healthy. Diarrheal control will equal diarheal, diarrhea, cdiff negative. And then case will equal diarheal, cdiff positive. All right. And clean up our code a little bit here. Make sure that pretty is loaded. And then down here in facet grid, we need an argument that we'll call labeler equals function labeler. And then we'll say disease stat equals pretty. And we now have our labels in but we now need to format our label and we can format our label by doing strip dot text dot x equals element markdown. And voila, we have our nice pretty labels across the x axis. I think this looks pretty nice. I mean, about as nice as we're going to get it, I think we do now have our individuals for each of the three disease status groups kind of indicating the end. There's a lot here. And you can kind of tell there's a lot there because the individual columns, the individual stack bars are pretty narrow. And we now see, again, because of the relative proportions that there's about twice as many healthy subjects as there are for these two diarheal groups, we can also get a sense of the variation in the data. Now, again, challenge with this depiction of the data is that I can really only mentally gauge to phyla here. I can't really get a sense of the proteobacteria. And frankly, I really can't get a sense of the bacteria deadies in these two diarheal groups, because, you know, there's some that have lots of proteos, and some that have lots of bacteria deadies to counterbalance what's in here for the formicities. That's a challenge, right? If we were to go down to like the genus level would be even worse, because it'd be far more wedges, or far more tiles per stack here, and it would just get a lot more complicated. Also, you're kind of asking your audience to do a lot of interpolation across all the data. And I don't know how well I can do that. I can see there's a lot more variation, I think consistent variation in formicities among people that have diarrhea than there is perhaps among people that are healthy. Sure, there's some people that have a lot more than, you know, more than 50%. But that's only, you know, maybe 10%, 5% of all the healthy subjects. Whereas, you know, perhaps half the people with diarrhea have more than 50% formicities. So, I think we have gained something in terms of the depiction of the number of subjects, as well as the variation. But, you know, what more can we say than kind of an individual phylum level, you know, basically formicities? I don't know, right? We're kind of stuck here. I think we've gotten a lot out of this, we've learned hopefully a little bit about how we can use facet wrap and facet grid to make a pseudo x axis label, if you will, and to kind of create three different chunks of three different facets of these stacked bars, where we can we can represent the individual subjects. So I'm happy with this how this looks. I'm not really happy about the visual itself. It wouldn't be my preference. But we're getting there, we're getting there, I will show you in the next episode, how we are stepping episode by episode to what I consider to be the best approach. We also need to remind ourselves frequently that there is no one perfect way to visualize data. There are constraints that are placed on us, right? The type of data we have, the quality of data we have, what our boss wants, what our audience wants, those are all constraints that will impact what you do, right? That my choice sometimes doesn't matter, right? The other thing to keep in mind is that the question varies, right? And that there might be questions where a faceted stacked bar chart like this could be what you need to display the data. And perhaps you only have two filers. Well, this would be perfect for that then, right? That could that could be really nice. I don't know if it'd be perfect, but it would be great, right? So again, we've got to keep all these different factors in mind, as we're evaluating critiquing different visuals. And so thinking about the question of, you know, how does phylum level diversity vary across these three disease status groups? This doesn't help me very much, right? I can do it at the fermicities level, but not really at Bacteri Dettis or the Proteobacteria so well. I really hope you enjoyed this and got something out of this. Even if you don't like the actual visual, I think you can learn a lot, right? About faceting, about factoring, about labeling. Again, we've seen this several times now in different contexts, these different kind of muscle flexes, if you will. And so hopefully you're learning that and getting more experienced with things like factors, factor reorder, factor shift, all these different things we can do to change the order of things as well as how we can use and navigate between facet wrap and facet grid to get different types of effects. Anyway, I'll leave it there. Please tell your friends what we're doing here in Code Club. Also, do check out those workshop materials that I have listed down in the description. I'd love to have you check them out. You're free to go through them on your own. People also get a lot out of coming to the virtual workshops that I do via Zoom. Hopefully my banter is a lot like this. And we do have fun with those and people tell me they get a lot out of them. Anyway, definitely check those out and we'll see you next time for another episode of Code Club.