 Can you publish a microbiome study without including a stacked bar chart? Hmm. How would you create those stacked bar charts? Should you generate those stacked bar charts? And why don't I like publishing stacked bar charts? Stay tuned for this episode and I'll tell you. Hey, folks, I'm HatchLoss and this is Code Club. I have strong opinions about stacked bar charts, pie charts, and other visualizations in the microbiome field. In this and in subsequent episodes, we're going to talk about those strong feelings, why I like them, don't like them, and what alternatives I would suggest that we use. Even if we generate plots that we don't like or aren't very good, we can still learn a lot about our tooling and using ggplot, dplyr, and other elements of the tidyverse along the way. Also, by figuring out what we don't like, perhaps we'll understand why we don't like it. And so we can avoid doing those practices in the future. And then we can think about what practices or what other types of visualizations would be better to overcome those limitations of the kind of crummy crappy visual we're making. Anyway, I'm over here now in our studio. And this is a script that we created in the last episode where I kind of did a whirlwind tour of dplyr. If you would like to get this code that we generated down below in the description, there is a link to a blog post that includes this starting our code. Also, across the top of the screen, you will see a video that you can go watch to see how I got everything set up with our studio, the tidyverse, and getting my data so that you can follow along. And perhaps along the way, you can, you know, tweak things, modify things, make it your own, and experiment with the code that I'm generating here. The code that we generated in the last episode, we read in the metadata so that we have subject IDs, and their disease status. We have an OTU counts table. This tells us the number of times we saw each operational taxonomic unit in each subject. And we have a taxonomy file that tells us the taxonomy for each of our otus at the kingdom through genus levels. Now, if you're not a microbiome wonk, don't worry, please stay tuned, keep going. Because I think even though you might not be really into microbial ecology and the things that I'm talking about, it's really an example for learning more about ggplot and other elements of the tidyverse and data visualization in general. Finally, what we did was that we joined those three data frames together using two interjoin statements, we then calculated a relative abundance for each of the taxa in each of the samples, and then we made it tidy so that we have a column for the taxonomic level and the taxonomic name. We have a data frame now with five columns, the sample ID, the disease status, the relative abundance, the level, and the taxon. And so, for example, in patient DA 0006, that person was a diarrheal control, so they had diarrheal, unfortunately, but fortunately, they were not infected with clisterine or disease difficile. At the OTU level, they had OTU, they did not have OTU1 because it's relative abundance was zero. They had this bactrae deadies, though, at a very low relative abundance. Now, you can see that bactrae deadies is represented multiple times for patient DA 0006. One thing we'd might like to do is to pool the taxa at the level, at each level, we want to pool each of the taxons. And so that's what I'm going to do first, is I want to create a stacked bar chart for each of my three disease statuses. I have people that have, that do not have diarrhea, people with diarrhea, but that do not have clisterine or disease difficile, and then people that have diarrhea and have C difficile. Those are what we call the cases. I can generate these stacked bar charts now at any taxonomic level that I'd like. And so we'll go about doing that. So let's do OTU relevant. And I want to look, let's start at the phylum level. So I'll do filter level equals equals phylum. And again, let's let's look and see what we've got. So now we see all of the data is at the level of phylum, which is great. The next thing that I want to do is group by tax on. And I also want to group by the sample ID, because within each subject or each sample, I want to pull that tax on so I can get the total abundance of each of those phyla. And so we will then pipe that into summarize. And we will then summarize that by doing, I'll do relabund as the sum of the relabund column, right? And so if Bectoid Eddies is represented 50 times within that one subject, which would mean that there are 50 otus that are Bectoid Eddies, I'm going to add their relative abundances together. This data frame contains the subject ID, the phylum, and then the relative abundance of that phylum in that subject. What we can see here is that this subject had very low relative abundance of Bectoid Eddies, and relatively high relative abundance of Proteobacteria, with the balance basically being Formicides, and not much else in their community. Something that I noticed we lost actually was the disease status. Also, we see that our data are still being grouped by the sample ID. So let's go back and look at that. Again, we lost the disease status. So there's a couple of ways we could get that back. So first of all, I could take the output of this and then join it to metadata. You'll remember that I did that actually up here on my line 22. So we didn't really need to join metadata here. We could have done it down here below. It doesn't really matter. The way we can get it back without doing another join would be to group by disease stat, sample ID tax on, and we run that. And now we see that we've got our disease status back. We still have our grouping by disease stat and now sample ID that we've got two grouping levels. And that's because we started with three grouping levels. And so when you just summarize, it peels off this final grouping level of tax on. If I want to get rid of all of those grouping levels, which I do, I can do dot groups equals drop in quotes. And now I see that I have no grouping levels. And then I have these disease statuses of sample disease stats, sample ID tax on and relative abundance. Excellent. Again, what I'm going for is I want to create three stacked bar charts one for each of the three different disease status groups. And to do that, I need to go ahead and summarize the data further. And so I'm going to go ahead and do another group by, because I didn't totally need to do that dot groups drop. But again, my preference is to be explicit in how I'm describing my groupings. So I'll do group by disease, stat and tax on. So I no longer want to group by the subject because I want to average across all of the subjects. And then I will then do summarize. I'll do mean, rel, a bond equals mean. And that's going to then be the mean of the rel abundance relative abundance. Okay. And so now what we see is that we no longer have the sample ID, but we have case tax on and the mean relative abundance. And again, I could then add that dot groups equals drop. And then pipe that to let's do a filter on tax on equals equals for Mickey D's. And we now get back, we have for Mickey D's and the three relative abundances for those three different disease statuses. And so we can see that the diaryl control and the cases on average have a higher relative abundance of for Mickey D's than they do in otherwise healthy individuals. Hopefully you're keying in on a few problems that we identified in a previous episode talking about bar plots, that we really have no sense of the amount of variation here. We also have no sense of the number of subjects in those three different groups. But hold on to that and we'll come back to that in a moment. Again, this was showing for the for Mickey D's, but we want all of the data, all of the phyla. And what we can then do is we can then pipe this into ggplot and our AES on the x axis, I'm going to put disease stat and the y I'm going to put mean relabund. And I'm going to do color equals taxon. And I will then add that and do a geome call. Geome call uses the actual relative abundances we have to generate the plot also save this with gg save. And I'll say Schubert stacked bar dot tiff. And I'll do width equals five height equals four. And I see that I made a common mistake that I make in that I used color rather than fill. So again, I'll come back to my code and instead of color here, I'll do fill. So that looks a lot better in terms of having the fill color rather than the color. Remember that the color aesthetic is the border of each of the bars, whereas the fill is actually the color the fill color of those bars. Okay, I'm going to come back and we're going to critique this in a bit. For now, let's go ahead and improve the appearance of our figure. I'm going to start with labs. And I'll do x equals null because I don't need to put the disease stat on the x axis, we're going to have labels for those three bars. So we'll put x equals null, why I'm going to put in quotes mean relative abundance. And then parentheses, I'm going to put a percent sign. And that reminds me then that in my summarize, I need to multiply my relative abundance by 100. So that's the percent relative abundance. Also, I will go ahead and add theme classic. So we see we've got our nice classic look without the gray background. We have our three disease statuses on the x axis, we'll need to fix that up a little bit, but we no longer have that disease stat label on the x axis. We also have mean relative abundance and percent, along with our numbers scaled to the percent as well. The next thing I'd like to take on is something we saw when we were previously doing those strip charts, or other charts for the inverse Simpson is to put our bars in order from non-diarrheal control, diarrheal control and then case, I'm going to come back up to my code for O2 relevant. And we're going to pipe that to a mutate on disease stat. And so we're going to say factor disease stat. And I'm going to do levels equals, and then I'm going to put in the names in the order that I want them to be. So that'll be non-diarrheal control, diarrheal controls, and then case, that's got to be in quotes case, and then I'll run all this. And so now I see I've got non-diarrheal control, diarrheal control and case along the x axis, which looks nice. I can then clean that up by doing scale, x, discrete, breaks, equals, and then I'm going to make that a vector, but I'm going to grab this from up above here, bring it down and paste that in. And then I'm going to do labels, equals, and then I'll do healthy. I'll do diarrhea, uh, C difficile negative, and then I'll also do diarrhea, C difficile positive. Okay, so let's run this. I've got to need that plus sign at the end there. And I see now I've got healthy, diarrhea, C diff negative, C diff positive. I'm going to go ahead and add library, gd text. So I can then italicize my C difficile to make that look fancy. There and there. And then I will also put in line breaks here. I forget what I called all these on my strip charts, something we need to add to get the GG text to work with our mark down here and the HTML is that I need to add theme. And then I'm going to do access, text, x element, mark down. We now see that we have healthy, diarrhea, C diff negative, diarrhea, C diff positive. I want to turn my attention now to looking at the legend. We've got this title of tax on, which doesn't do a whole lot for me. Also the names here and the squares for the key are quite large. So I want to get rid of that tax on and I want to make the font smaller. Also, these are bacterial phyla. And so at least according to ASM's journal standards, they should be italicized. Let's go ahead then and add in scale, fill, discrete. And I'm going to do name equals null, which should get rid of the name there. So now we've gotten rid of that name. Next, I'm going to come down into my theme. And I will do legend dot text. And I'm going to element text. And I'll do face equals italic. We now have our phyla names all italicized, which looks good. One name that sticks out to me as this bacteria unclassified, we'll come back and deal with that here in a moment. So I want to make those phyla keys a little bit smaller. We'll do that also down here in the theme. So we'll do legend dot key size. And I'll do unit. Let's do 10 pts for 10 point size. So I think the font size now looks a lot better. It's the legend isn't taking the full height of the figure. One thing that sticks out to me that always annoys me when I see this show up in papers. So we've got bacteria underscore unclassified. We don't need that underscore that just looks unprofessional. And the unclassified shouldn't be italicized. Ideally, it would probably be unclassified bacteria. So let's go about seeing if we can do that. Also, you know what? Instead of using the face equals italic, I think I'm going to use element mark down to italicize my phyla names. But that means I also need to put stars on either side of my phyla names. So let's see if we can do that here quickly. I'll insert a mutate up here, mutating my tax on using str replace. And I will replace that with our work on do the string replace on tax on for my pattern. I'll go ahead and use a single parentheses. And the single parentheses in a regular expression means save the contents of what's matched in the parentheses. So we'll do dot star so mix match everything leading up to an underscore unclassified. And then we will replace that with a capital unclassified star back back one star. And so that should put stars around bacteria, but leave unclassified without stars. And so now I see how unclassified bacteria and starred down here. And I now need to put stars around everything else. So let's come back up here and see if we can't add another string replace. So we'll do tax on equals str replace on tax on. And the pattern we're going to look for is again, we're going to put in parentheses, and I'm going to look for a string without spaces. And so I can do that back back capital s. So back back lowercase s means match a space back back capital s means match a string without a space. And I'm going to match the full thing. And I want this to start and end the string so it wanted to span the whole string length. And we're going to then replace that with star back back one star. And now we should have star at the beginning and ending of all our phyla names that don't have spaces. This one now does have a space that I have up here for the unclassified. So we shouldn't get extra stars around that. So let's see what this does. We now have stars around all of our phyla names and we have that unclassified space bacteria. Coming back to my code, I'm going to go ahead and turn off this face italic. And instead of element text, I'm going to do element markdown. And now we have italicized phyla names, but the unclassified is not italicized. And so that warms my heart because it all looks nice and professional and clean. It also has a nice effect of putting this odd tax of unclassified bacteria at the bottom. Okay, let's critique this. Let's see what we think about this. Okay, a couple problems from the get go. First of all, there are 13 colors that is way, way, way too many colors. I can't keep track in my head of 13 colors. Also, besides trying to keep track of those colors, it's very difficult to keep track of, you know, variation in those colors, right? And so at the phyla level, we're doing okay to see kind of that this orange corresponds to, I think, the Bacteroidetes and this, you know, lighter reddish color up at the top is Actinobacteria. I don't know that I can see the Campylobacteriota or the cyanobacteria. I then have these Formicudes. You know, I'm kind of wondering is this color here between the Formicudes and what's probably the Proteobacteria? Is that Fuzobacteria or Lentyspherae? You know, it's too small and the shade is too close between those two phyla for me to really be able to differentiate them. And then these four phyla, really five phyla from Spyrokeets on down, are really difficult to distinguish. And I might even say, you know, I can't distinguish between Proteobacteria and Spyrokeets to know if this blue blob is Proteobacteria or Spyrokeets. That would be important to know, right? So again, when you get so many colors, it's difficult to keep tracking your brain of what the colors map to. And there's not enough resolution between the different colors to really be able to tell them apart. So that's a big problem that I have with Stacked Bar Plots. Another problem with Stacked Bar Plots is that for the most part, we don't have a common base to compare the size of the bars. Now, for Bacteroidetes here, we kind of do because this top population, this top phyla, Bacteroidetes is really small. And so because we got lucky, these are all kind of anchored to a common point on the y-axis. And so I can see that healthy clearly have a lot more Bacteroidetes than those with Diarrhea with or without Cetaphacyl. But if I come and look at the Formicides here, I don't have a common base that I can anchor those three rectangles to, to say which is larger or smaller, or how much bigger or smaller are they in one group versus another group. Again, with the blue, which I'm pretty sure is Proteobacteroid, we again have a common baseline here, but we got pretty lucky here. With other phyla and other communities, we might not be so lucky to have a common baseline for those three different phyla. So the too many colors and the lack of a common baseline to compare the size of rectangles are two of the big challenges with stacked bar plots and why they really just don't work. Again, we're working at the phyla level where there's only 13 different taxa. If we were working at like the genus level where there's dozens or hundreds of different genera, it's an even bigger mess. One of the other challenges that we've talked about in the past with bar plots, but is particularly acute here with stacked bar plots is that we have no sense of the variation in the data. I can't really put error bars on these individual rectangles, which just causes problems. Another challenge is that I don't know how many populations are represented by these bars. We've talked about that before with bar plots. Yeah, I could put an end below in the labels here, but that's not really all that appealing. So I am not a fan of stacked bar charts because of these reasons. Again, the big reason is that there's too many colors and there's a lack of a common anchor to compare the length of the bars. One of the biggest and most important pre-attentive attributes is position on the y or x-axis. Having that common anchor allows us to utilize that pre-attentive attribute and allows the user to have an easier time of making comparisons. This is something that we'll come back to and we'll try to remedy in future visuals that we develop. So something that I'll leave for you to do as homework and see if you can take this plot that I made at the phylum level and put it down at the genus or even OTU level and see what you think. How much worse can it possibly get? A lot worse. A lot, lot worse. Down below in the comments, please let me know what you think of my critique of stacked bar plots. They're very easy to make in R. It's the default configuration if you use geom call where you're filling by your taxon and you're on the x-axis putting something like a disease status or some type of treatment group. That works very well and it's very easy but just because it's easy doesn't mean it's right. So let me know what you think of my critique down below in the comments. Anyway, hope you found some value in this. I know I'm leaving you hanging by saying this is kind of like we say is like putting lipstick on a pig but we will come and kind of remedy the situation as we go forward in looking at other ways of representing the same data. Please tell your friends about these code clubs. Be sure to give me a like on this episode and be sure you've subscribed and click that bell icon so you're notified when these future episodes are released so you're not left thinking that you know there's no good way to represent relative abundance data using ggplot. There definitely is. Anyway, we'll see you next time for another episode.