 Absolutely loathe stacked bar charts. For many reasons. Anyway, sometimes you don't really have a choice though in that your boss or your audience, they expect you to make stacked bar charts. Is there a way to make them better to overcome some of the limitations of the format? Definitely. And we'll do just that in today's episode of Code Club. Hey folks, I'm Patchloss, and yeah, I really don't like stacked bar charts, but you know what? Sometimes you don't have a choice. Maybe your boss, your PI, they expect you to make stacked bar chart because that's a format they like for whatever reason. Or perhaps your audience expects you to make stacked bar charts. In the microbiome field, everybody seems to love using stacked bar charts. Why? I don't know. I wish they would watch these videos so they could see why it's not such a great format. Well, there's a few reasons why stacked bar charts aren't a great format. And so looking at this figure that we made in the last episode of Code Club, two things that I really want to heighten in today's episode. First of all, there's like 13 different phyla being represented, and it's hard to keep track of what color corresponds to which phylum. Also, with 13 different colors arrayed across the rainbow, it's really difficult to have colors that are really discreet enough from each other so that you can distinguish them over here in the bar chart. Another big problem with stacked bar charts is that oftentimes you don't have a common basis for comparison. I call this an anchor on the y-axis. Yeah, at the bottom, we can see that that bottom group does have an anchor, but that might not be a group that we're all that interested in. Is there a way that we can limit the number of colors, limit the number of groups, and then provide more than one anchor to give our audience something to compare the size of those bars based on? Well, let's go over to our studio and we'll see what we can do about all this. And we see this chunk of code that you can get a copy of down below in the description. There's a link to a blog post that accompanies today's episode. You can get this chunk of code. I strongly, strongly, strongly encourage, maybe insist that you also get a copy of this code so that you can work alongside me in today's episode. There is no substitute for working with the code on your own to learning this material. You can watch all these videos that I'm producing and really not learn much at all, unless unless unless you work with the code. Anyway, get the code link down below. Also, up above, there's a link where you can get instructions on how to install our studio, get the tidyverse package, get the raw data that I'm working with here. It's really critical so that you can follow along and you can get the most out of these episodes. All right, so we have this chunk of code. We load in our libraries. We've got gg text here so that we can get that nice italization of our phylum names. We read in the data. We did this a few episodes back, talking about dplyr. We joined them all together. In this chunk of code, we take our O2 relative abundance data and we filter by the phylum level taxonomy. We can aggregate those counts, those relative abundances, so that we know the average representation of each phylum in each disease status group. We've got three disease status groups if you haven't been following along. First is among people that are healthy. One is among people that have diarrhea. And then the third group is a group of people that have diarrhea and also then have an infection with Clostridioides difficile. Anyway, we then go through and we clean this all up. We then build a stacked bar chart where we then use some styling to fix the labels and to get our nice italization in our legend. Let's go ahead and run all this, make sure it works. The code runs, produces the figure that we're expecting. There's a couple of things that I would like to do in today's episode. First of all, what I'd like to do is further aggregate my phyla to create a threshold so that if a phylum is not over some threshold in any of the three disease statuses, I want to pull its relative abundances into a separate category that we'll call other. I think a lot of these phyla are really rare and so we can't even see them here on the figure. That will also solve another problem then of giving us more discrete separation between the colors in our color palette. The second thing that I'd like to do is think about which phyla do we put down on the bottom along the zero of the y-axis and which do we put along the top on the 100%. The first thing I'm going to do is go ahead and break up my pipeline. I'm going to take the ggplot off the end of this. We'll save it for later, but for right now I want to work with O2 relabund. Looking at this, we have our disease status, our formatted taxon and the mean relative abundance. Let's go ahead and save this as taxon relabund. That is good. And of course we can then take taxon relabund and feed it in the ggplot to get the same plot that we had earlier. That's not what we want to do though because we want to go ahead and pull together those phyla that are more rare. Let's go ahead and first see what the distribution looks like of our different phyla. So I'll go ahead and group this by and I will group by taxon. Let's look across the three disease status groups. So I'll do a summarize and what I want to see is the maximum relative abundance of each phylum across the three groups. So I'll do max equals max on mean relabund. And what we'll see, of course, is that we get that output. It's a little bit jumbled. I like to look at it with a range, DESC on max. That shows us now that we have our phyla ordered by their maximum relative abundance across the three disease status groups. Proteobacteria, Firmiculitis, Bacteria disease are the three most abundant phyla in our cohort. You know, I might, I'm tempted to go down to Rucomycrobia. I'm a little worried that that 3.3% might be too small of a wedge or of a rectangle to see. I'm pretty confident that the Fuso might be, is definitely too small. I'm going to go ahead and require that everything has an abundance over 3%. And we can adjust later if that rectangle is just too small to look at that. Then I'm going to go ahead and modify my summarize to make a pool column. And what this will then be is I will pool if the max relabund is less than three. And then I'll also do dot groups equals drop. And let's look at that output. So then we see that we've got, we're pooling things like a tinobacteria, a campylobacteria, chloroplasts and so forth. So we're in good shape. And I will then say this as taxon pool. Make sure I've got that saved away. Now what I can do is let's get some space here and put everything on one screen. We can do inner join. And we're going to join our taxon relabund and our taxon pool will then join by the taxon column. And we see now that we've got our disease status or taxon mean relative abundance and whether or not we want to pool it. I'm going to use this pool column to indicate whether or not the taxon column should be changed to other. Let's do that with a mutate function. We will go ahead and pipe this into mutate. And we will then say taxon equals if underscore else. So if else pool, then put other. Otherwise, let's leave it with taxon. So now what we see is that we have our disease status or taxon. Everything is before, but in the taxon column, if pool was true, we put other. Otherwise, if it's false, we then put the name that we had. And again, these are already stylized with our stars. So they should be italicized in our legend. We'll go ahead and group by taxon, right? So because this is where we're actually going to pool or aggregate our different filer together that we indicated we wanted to pool. So we will sorry, we'll group by disease stat and taxon because otherwise we're going to be pulling across all of the disease statuses. So we want to keep each disease that separate. And we will then do mean relabund equals some mean relabund. Here we now have our three disease status groups, the four filer plus the other category and their mean relative abundance. And we can see that that other column 1.7, 1.76, 2.11. So it's really small. And we're not really missing much by pooling all those other filer down into that other category. Now we can pipe this into ggplot. Excellent. We've gone from 13 filer to four filer plus an other category. One thing that I like about this is that by chance, bacteria deadies is anchored along the 100% line of our plotting window. This makes it very easy to compare the relative abundance of bacteria deadies across the three different disease status groups. You could imagine putting something along the zero position on the y-axis as well. So it'd be easier to compare say two different filer in the same figure. So perhaps we could put Bactroides here and Formicudes down here or vice versa or whatever. And that way then at least we'd be able to make easy comparisons for those two more abundant filer. Something to know about how ggplot is creating this figure is that for each column of bars, of the stacked bars, is that it is converting our taxon column to a factor using alphabetical order of those names to set the order of those bars. You're saying though that like, well, how does other come after Proteobacteria and Rucomicrobial? Remember that this isn't really Bactroid deadies or Formicudes. It's Star Bactroid deadies, Star Formicudes, Star Proteo, Star Vruco, and then Vanilla Other, right? And so Other comes after a Star, the O comes after the Star. We can reorder the factor back in our code. Again, I'm going to break up my pipeline here and I will then do a mutate on taxon. And I will then say factor taxon. That is going to apply alphabetical ordering. And so we now see that our taxon column is of type factor. I'm going to go ahead and drop that grouping because I don't want weird things to happen. And we can then set the order of that factor of taxon by some other variable. And so we could say taxon equals FCT reorder. And FCT reorder is a function that comes to us from the four cats package. Factors are notoriously painful to work with in R. Again, they're a way of turning categorical data into ordinal variable data, which is what we're trying to do here, right? We're trying to set an order to our categorical variables. So we give it taxon and then we give it some other column. Well, I don't have a column to hit to order that on. To get the order that I want for factor reorder, I'm going to revisit my summarize up here for taxon pool. I'll go ahead and then say mean equals mean, mean relabund. And that will then give me the mean across the average relative abundance for each file across the three disease status groups. So let's look at what this would appear to be. And so we see we've got our file, whether or not to pool it in the mean. And then we read that into interjoin. We're getting errors, of course, because I went too far. If we do the first join, we see that we have the mean. We then mutate when we pool. So now we've got the other we then group by. And so when we then come to the summarize, we're going to lose that column, I believe. Yeah, so we need to have another line here in our summarize as part of that interjoin line. So let's do mean, and I'll do men mean, because the men is going to be the same for those big four phyla. And for the other, it's going to be very low. Yeah. And so then we see that we have for Mickey D's, which was the most abundant across the four. And then back to our daddy's and then proteo and then Vruco, and that repeats for all of these. So we're in good shape. We can then give this the mean. And that should then reorder everything according to the mean. And we can check the order by piping this to pull on tax on and then looking at levels. And so this will give us the levels that are in our tax on column. And so we've got other Vruco, proteo, back to our daddy's for Mickey D's. It's actually the opposite of where we want to fix that order to FCT reorder. We can give it a DSC equals true. And so that will give it a descending sort of the factors. And so now we've got for Mickey D's back to our daddy's proteo, Vruco and other in good shape. And we can then pipe this into our GG plot. And we see that we've got for Mickey D's back to our daddy's proteo, Vruco and other. If we wanted for Mickey D's on the bottom, followed by back to our daddy's, then we could have turned off that DSC, right? We could have not use that. And then for Mickey D's would have been at the bottom. Let me show you what that would look like DSC. We could say false, which is the default. And now we get for Mickey D's on the bottom, back to our daddy's Vruco other and so forth. I'm going to go back and turn that DSC to true. I'm also going to use another function from four cats, which is FCT shift. And we will do tax on and we'll say n equals one. And so what that will do is again, we'll have for Mickey D's back to our days and so forth. But shift equals one with n equals one will shift the order so that for Mickey D's now will become the last. And what that should then do is put for Mickey D's on the bottom and back to our daddy's on the top. Voila, we now have for Mickey D's on the bottom, back to our daddy's on the top. Again, the nice thing about this is that we now have an anchor for our two most abundant phyla, right? So the for Mickey D's are on the bottom, like I said, back to our daddy's across the top. And we now have a very easy way to compare the relative abundance of these phyla with a common position on the y-axis. Yeah, well, you know, it's not so easy to compare the proteobacteria, right? So that's the next largest group. And yeah, it's easy to see that the proteobacteria are more abundant in people with diarrhea, regardless of C. diff status than they are in people that are healthy. But, you know, within between those C. diff positive and C. diff neg groups, it's difficult to see the difference in abundance for the proteobacteria. I'm okay with Avruca Microbia being in here. It's kind of lost, but it's okay. We'll come back and deal with the figure with the colors in a moment. So we've taken on two problems now, right? We've shrunken the number of phyla we're looking at. And then we also did our best to deal with anchoring on the y-axis. So we have a common basis for comparison for at least two of the phyla. There's a few other things that I'd like to do to clean this figure up and make it look a little bit nicer. First of all, I might like to put my legend in alphabetical order with other at the bottom. That way it'd be easier to interpret and see what's going on. I don't like having other in the middle. I don't like having fermicates at the bottom. This default order is the order of the factor as it's being represented, right? Because again, we shifted fermicates to the end. We can alter that by modifying our scale fill discreet to be scale manual discreet. We'll then do breaks. The vector for our breaks argument is going to be the order that we want them in the legend. So I'll do star bactrory deadies, fermicates. Let's go down here so we're not running off the side of the screen. And then proteobacteria, rucomicrobia. I need a star in there as well. And then here we'll do other. And then we need values. And here I'm going to do, I'm going to just make stuff up. So this isn't where we're going to end. But let's do red, green, blue, orange. And then the other, I know I want to be gray. So I have an error scale manual discreet. I wanted scale fill manual. So that doesn't look good. I think I forgot the fermicates. I misspelled fermicates because that's missing fermic. Oh, U-T-E-S, not T-U-S. Okay. And it appears I misspelled green. Beautiful. Look at those colors. Aren't they just hideous? Let's go ahead and fix these colors to make them look a little bit nicer. To do that, I'm going to load the library rcolorbrewer. Make sure that's loaded. And then I'm going to replace these four colors with a function call to the rcolorbrewer package. I'm going to leave that gray because I like the idea of other being kind of grayed out. There's not really any information in there, except that there's this couple percent that we don't know what to do with. So I will do bruer.pal and I'll say four. And then the name that I like is dark two. And so that looks pretty reasonable. Again, you could do all sorts of customization with colors. This is nice because I know that it is red, green, colorblind, safe. We have our phyla. We have other last. Again, we have back droidettis anchored at the top, fermicates anchored at the bottom. Life is good. I'm pretty happy about that. One other thing that frequently comes up when I show bar plots like this is that people get a little bit weirded out that there's a space down here below zero and the x axis. We can come back up to our scale area. So I'm going to go ahead and do scale y continuous and we'll use a function called expand. And that we give it two numbers. So I'm going to do zero comma zero. And what that'll do is not expand beyond the y axis and be sure we put a plus sign. And so now we see that the bars go all the way down to zero and above to 100. And that looks pretty cleaned up and pretty nice. So I think this looks about as good as we can make a stacked bar chart look. We've done a couple of things in this episode. So to remind you, we started with 13 different phyla and 13 different colors. And that was just way too much to try to keep track of in your brain. Also, those colors didn't have very good separation between them. And so sometimes you're left wondering, is this that phyla or that phyla? Who knows. So by going down to four phyla as well as another category, we now only have to keep track of four groups. And the colors have much better separation in them. You know, I could have done without perhaps the Vruco Microbia in this example. That being said, I don't think it's that hard and kind of this pinkish color makes it easier to kind of remember, oh, that's the Vruco Microbia. For some reason that color for me just kind of stands out kind of like the gray, you know, stands out for me that that's other. Who knows. But again, the other thing that we did was to orient the different bars in our stacked bar chart so that they have a common basis of comparison. Right. And so we can put the bacteriities at the top with 100% line as the basis of comparison. We can put the formicities on the bottom with a 0% line as the basis of comparison. Yeah, we can't do that with a proteobacteria. But, you know, we have our limitations in the in the medium. And if you stay tuned for future episodes of Code Club, I'll show you a better way of looking at relative abundance data so that we don't have that problem so that every phylum can have a common basis of comparison. Something I also want to point out is that we kind of get lucky here and that we have so few phyla and three different treatment groups. I have seen things where people perhaps have stacked bar charts over time, and perhaps they're not looking at phyla, perhaps they're looking at genera where there's dozens of different genera. And there it's just difficult right next to impossible to get things down to four or five different groups or to have a common basis of comparison. That's the same across all time points. So again, the data is working with us here a little bit to make it easier to depict this data. There's still a couple problems that we just can't quite solve. The first I guess we could probably could solve which would be to indicate the N. So we talked about this before that one of the problems with bar charts stacked or otherwise is that you don't get a sense of the number of observations that go into calculating that that value that's represented by the bar. Sure, down below each of these x axis labels we could put like in parentheses N equals something. I'll leave that for you to do as homework. The other problem that we can't get away from with stacked bar charts is how we've transformed the data statistically. So we are using a mean, right? Otherwise, if we use like the median this the bars would not add up to 100% and that would cause all sorts of problems and kind of visualizing the data and making fair comparisons across the columns. Related to that, the data are highly skewed right that there are more things that are rare and a few things that are abundant kind of bringing up the mean perhaps. And so we don't get any sense of the variation in the data. You know, and so I think whenever we report a mean or even a median without a sense of the variation the data, we tell people that there is one healthy microbiome there is one diarrhea C def negative microbiome and so forth right, which is just not the case. We will talk about those issues in future episodes of Code Club. If you want to be sure that you hear my thoughts on those other ways of visualizing the data so that you can have a common basis of comparison for all of your filer or Jenna or whatever. Be sure you subscribe to the channel and that you click the bell icon so you're notified. Also, please be sure that you hit that thumbs up button so you know that you appreciate what I'm doing here and that you're getting some value out of this. If you want to get the most value out of these episodes, please, please, please go back, download the data and work through the code with me in parallel so that you can work with the data yourself. That is the number one best way to learn how to work with our or any programming language. If you like this stuff. Wonderful. Also down below in the description are links to tutorials that I have developed that step you through analyzing microbiome data with tools from the tidy verse. I teach three day workshops that I'd love to have you participate in. There's information down there as well about how to get registered for the workshop till next time. Keep practicing. Tell your friends about what we're doing here and we'll see you for another episode of code club.