 One of the most common ways of representing relative abundance data in microbiome studies is to use a stacked bar chart. In recent episodes, we've talked about the problems of stacked bar charts. Today, we'll overcome one of the most fundamental problems with stacked bar charts by using the position argument in the geom call function. Hey folks, I'm Patch Loss and this is Code Club. If you've been following along in recent episodes, you know that I've been stepping through a variety of different tools and approaches that we can use to represent relative abundance data for microbial ecology studies. We've talked about stacked bar charts. We've talked about pie charts. We've talked about heat maps. We've looked at all sorts of different combinations when we've talked about the strengths and weaknesses of each of these approaches to representing the data. One of the main problems with these tools, primarily the stacked bar charts and pie charts, is the lack of what I've been calling an anchor on the y-axis, from which to make a comparison across our different disease treatment groups. You can imagine that if there is a tile in the middle of a stacked bar, like we talked about previously, the proteobacteria, that it's difficult to make an apples-to-apples comparison across multiple different treatment groups because we don't have a common frame of reference on the y-axis with which to gauge the change in the relative abundance of, say, the proteobacteria. Whereas, say, something like the fermicides or the bactroid eddies, we're able to put across the zero axis point on the y-axis or the hundred percent, which then allowed us to make a much more direct comparison across the three treatment groups for those two phyla. But again, as you get more and more populations that you're trying to compare the relative abundances for, you lose those anchoring points. So in today's episode, we're going to use the position argument in GeoM call to anchor everything at the zero point on the y-axis across the x-axis. And we will talk about what is the optimal way to group our variables. Do we group by taxa or do we group by disease group? And why it probably depends on what your actual question is. But we'll deal with all that as well as some other issues that come up along the way and how we can use the great functionality from within ggplot2 to make an attractive publication quality figure of our data. Moving over to our studio, we have the code that we are starting with for today's episode. Again, if you want to get this code that I'm working with, so you can work in parallel with me or perhaps go off on a tangent and try something else out, you can get a copy of this code from a blog post that's linked down below in the description. I really encourage you to grab that, copy that, get that into our studio. Also, you can get the data that I'm working with at a link up across the top here, as well as see how to install our our studio, get the tidyverse package and these other packages that I've been using. In this code, you'll see that I am loading four different libraries, they are all installed actually with the tidyverse. So some of these, though, like readxl, ggtext are a color brewer, you have to you have to load the package into the library because it's not loaded automatically when you load the tidyverse. Anyway, we then get in our metadata, our otu counts, our taxonomy information, we then use all that information and join it together into a big data frame, where we are also calculating the relative abundances. We have data at the kingdom through otu levels. And then we set a factor of non diureal control diureal control in case. Oh, yeah, these are data from a study looking at Clostridioides difficile infections in three cohorts of people, people that are healthy, people that are hospitalized but have diarrhea, and people who are hospitalized with diarrhea and also who have C difficile. In this next code junk, we're creating a data frame called taxon relabund, where we take that big joined data frame otu relabund, we're filtering it at the phylum level, and then aggregating together all the relative abundances for the otus that make up each of those phyla. Then what we do is that we take all of the subjects within those three disease status groups. And within each of those three disease status groups, we calculate the average relative abundance for each phylum. We multiply it by 100. So we get a percent. And then we use string replace to do some modifications on those phylum names. So they look nice and we can italicize them using the functionality that we loaded from that gg text library, we then create a variable taxon pool, where we're looking at what phyla occur at a level below 3% in all three of the disease status groups. If if none of the phyla have an average relative abundance above 3% in any of the three disease status groups, we then pull them together. And that's because in a lot of these visualizations, we can get these really thin slices, I call them shrapnel, taxonomic shrapnel, that you can't see, but creates tons of colors in your legend and just makes it really difficult to distinguish between your different taxonomic groups. Finally, we do a join between our taxon relabund data frame and our taxon pool data frame, and then use that information from the pool to tell a group by and summarize, which phyla to pool together within each of the three disease status groups. What we do before we plot is we define taxon as a factor. And we are ordering that factor in terms of decreasing relative abundance. And this was the tool that we use so that we could put the firmicities on the bottom and the bactrodeities up at the top of our stacked bar charts. And then we have a bunch of styling here to go and make the plot. So here's the set bar chart that the code generates. We can see that we again have the firmicities arrayed across the bottom at the 0% y-axis anchor, and the bactrodeities again arrayed across the three disease status groups anchored at that 100% on the y-axis. And as I said, the challenge then is for the proteobacteria, where it's not totally clear if they're the same relative abundance in these two diarrheal categories, or again, if the verucomicrobia have the same relative abundance. And the difficulty there is that we don't have a common anchor position on the y-axis to compare those different groups. So what we're going to do today is to pull apart these stacked bars. We're going to unstack them. To achieve that, I'm going to come back up here to geome call. And I will do position equals quote dodge. We have our deconstructed, if you will, stacked bar chart, where we have our three disease status groups across the x-axis. And then we have separate bars for each of the four phyla, as well as the other category. We're going to we're going to work more with this, don't worry to make it look even better. But before we do, I want to come back up to my geome call function and talk more about this position dodge, as well as different arguments that we can use in geome call. One other value that we can use instead of dodge is position underscore dodge. And as we go through future episodes, we're going to look at more ways of plotting this type of data. And we're going to use different position arguments in there. So I kind of prefer to use position underscore dodge, because that works well with other arguments or other values for the position argument. So we'll see something like position jitter dodge, or position jitter. And again, this framework works well for me. It's easier for me to keep track of believe it or not, than putting in quotes dodge. We get the same result, right? And if I come back up to position dodge, I could do something like width equals 0.5. And so what we see then is that the dodge is half the width of the column. And so this kind of gives an interesting look, right, where we've got bars in front of each other, kind of a cool look, generally not what we want. And so to get away from that, what you could also do is we could add the argument width equals 0.5. The result of that then is that our columns now are half the width that they previously were. And because our dodge was 0.5, everything is dodged by half a unit. The other benefit of doing that is now we see we have more of a gap between our three disease status groups, which is a nice look in some cases where, you know, if you've got too many groupings, the distance between each of the three groupings kind of gets hard to resolve. And so being able to play with the dodge width, as well as the width of the columns is really nice. Now you don't really need to set the width argument in both position dodge, as well as for geome call. If I leave out width equals 0.5, but put in 0.5 as an aesthetic for geome call, I'll get that same result, right? So those narrow columns with a good amount of separation between the bars, these bars are a little too thin for me, I'm going to go back to the default and I'll remove the width equals 0.5. They're wider and perhaps not as much separation as we might like, but we're going to do some tweaking here and we'll reassess that column width and and dodge width here in a moment. Before I do that, though, I want to talk about how we have our columns grouped. The grouping of these columns is not ideal. What I'd prefer to have is have a grouping for Bacteroid Eddies, a grouping for Formicities, a grouping for Proteobacteria, Vruicos, and other, right? So if I have three bars for Bacteroid Eddies, then it's much easier for me to compare the average relative abundance of the Bacteroid Eddies across those three disease status groups. To change the grouping, I'm going to come back up here to my ggplotline and instead of x equals to disease status, I'm going to set the fill color to vary by the disease status. Also, instead of fill equals tax on, I want my taxa across the x-axis. So I'll say x equals tax on. And what we get back is a blank plot. Why is that? Well, that's because we set scale x discrete to have these breaks of non-diarrheal control, diarrheal control case. And those are our disease statuses, which are now the fill, rather than the x-axis position. So I'll go ahead and comment that out. So I'm not going to delete this code just yet because I'll probably want to reuse it because we'll want to shift scale fill manual and scale x discrete. And so I'll probably want to recycle this code later. Now we see that we have our taxa arrayed across the x-axis. This is a much better way to array the data to make comparisons of the different fila across the three disease status groups. So for example, we can very easily see now that the Bacteroid Eddies are much higher relative abundance than in the healthy individuals and they are in people with diarrhea, whereas the Protea Bacteria are much higher in people with diarrhea than they are among people that are healthy. And we see that the difference in relative abundance between people with diarrhea who are seat of negative and seat of positive really isn't that big of a difference. Couple things about this that I'd like to change. First, I want to move that Formicities back to the left side. I have in my code a shift function that was what we use to put the Formicities on the bottom and the Bacteroid Eddies on the top. So we'll do that. And we'll also go about cleaning up our legend and maybe making our figure a little bit wider and see if that helps with the overlap in our legend, in our in our x-axis names. So let's first change our width to be 6.5 inches in height before. And we see now that we have better separation of our data. So back up here, we have this taxon equals factor shift tax on n equals one, which meant to shift our our our fila. So that again was what, you know, put our Formicities on the bottom and Bacteroid Eddies on the top. I'll go ahead and remove this. So that undid the unshifting and now we have Formicities on the left, Bacteroid Eddies, Proteo, Vruco and others. Very good. Let's go ahead now and see about moving this legend. And we can do that inside our theme function, we can do legend dot position. And we can define the position using a C function to define a two element vector. So the first spot is x and the second is y and it's relative to the plot. So I'm going to start with 0.8 0.8. And we see that that brings it inside. And now we lose kind of that snugness, if you will, between the Proteobacteria and Vruco Microbial. And I think this looks a lot better. Next, I'd like to fix our legend and coloring. I'm going to go ahead and remove this code for scale fill manual and uncomment the scale x discrete. And so I'm going to make this scale fill manual. I'm going to use the same breaks and labels, going to name equals null that will get rid of that disease underscore stat column or legend name. And then we also need values. And so the values are going to be the colors that we want. So for healthy, we've been using gray for diarrhea, but seed of negative, we were using blue. And then for diarrhea with seed of positive, we're using red. And it seems like a misspelled manual. And now we see that we have our three colors, keeping with kind of the color scheme that we were using several episodes ago, when we were looking at variation in community diversity for these same samples. So that looks pretty nice. We also have our stylized legend with those names plugged in, not the variable names. A couple problems that we have a few line breaks in here for our diarrhea names. So I'll come back up to my scale fill manual code. And I'm going to remove this second line break. So that maybe these labels will go over two lines instead of three. And so it looks okay. I'm noticing that the key height is the same as the line height or the height of the label. I can come back into a legend key size. And for now, let me go ahead and comment that out and see what happens. So I think this looks pretty good. I like that we have our coloring scheme that we've seen before, that we can make easy comparisons across our three different disease status groups for each of the phyla. It's much easier to see the relative differences between the three disease status groups. Now, I can easily see that the Vruca Microbi are kind of increasing as we go across these three disease status groups, whether or not that's statistically significant, I don't know. But again, it's easier to see, right? So that's really nice. Again, this is looking at the phyla where any one phyla has a relative abundance over 3% on average. And so that's where we get that other pools, that these are the phyla that none of them are above that 3% threshold. And you can imagine that, you know, if we looked at a lower threshold here, it would just be really hard even to see those wedges, because they'd be so small, like it's tiles, right? So I think this looks good at the phylum level. What if we were to modify the code to now look at the genus level, what we would do is let's come up to where we did our filter. And so we did filter level equals phylum, and I'm going to modify this and say genus. Excellent. So for the most part, this worked pretty well, right? We have sets of stacked bars for each of the genera. Of course, I can't read the names of the genera across the x axis, I could maybe shrink the size of the font, but then it'd be too small. So we'll come back and deal with that here in a moment. The only thing I noticed is that my other category is really tall. And so that is the most abundant group of genera in this data set. So again, we're using that 3% threshold, where we're requiring that on average, within each of the three disease status groups, at least one of the disease status groups has to be above 3%. So I might go ahead and reduce that down to 1%. But again, we'll come back to that moment. First, I want to deal with these with these titles across the x axis back up here in axis text x, I can add angle equals 90. And we talked about this in a previous episode, but that we can, again, angle it. So I guess, when you're looking at it, crane your neck this way to read those titles, which is not really satisfying, I'd say, right? So let's see if we can do a few things to clean this up. So first of all, these unclassified names are long. And so I'd like to put in a line break between the unclassified and that name. And so coming back up here, we had this code, where we did the unclassified, and I'll put in a br after unclassified. Also, in this next line, where we are putting stars around phyla names or genera names that don't have unclassified in it, we're looking for names that don't have spaces. So what I want to do now instead, because I've removed that space, is look for things that contain kind of those greater than less than signs. And to do that, again, this goes back to the idea of regular expressions, which I encourage you to go learn about. I will do the carrot less than. And what that means is match any character that doesn't contain a less than sign. Okay, so this square braces, square brackets, is kind of me creating my own meta character, if you will, so that back back capital S meant match characters that don't contain spaces that aren't spaces. Here, we're matching characters that aren't the less than sign. And now what we see is that we still get our titleization, and we've put that line break in. But one thing I'm noticing now is that everything is shifted to the right a notch, right? So you can kind of tell this vectorities doesn't line up on the tick mark. To fix that, I'm going to come back to access text, text x, we'll go ahead and do V just equals 0.5. And that'll be a vertical justification. So that this is weird to think about, because we're looking at a 90. But if we think about it, angle of zero, where it's, you know, parallel to the x axis, the V just would mean kind of vertically center things. So sure enough, that did the trick. We now have everything centered on the tick. I would also like to justify things to be up at the tick. And so I then will do H just equals one, which should then right justify again, think about things without that angle. And so now we see we have everything slammed up against the x axis. And that's easier to read, I would say, and a little bit cleaner. Again, I'm still craning my neck, turning to look at it. So one of the things we talked about in a previous episode, is that we could take this and pull pot and shift at 90 degrees. And so we saw that we could use chord flip before. But as I showed in that previous episode, things get kind of confusing, because you're not sure what's x, what's y, and so forth. And I think if you're, you know, if you're kind of experimenting, chord flip is a nice way to go. I'm going to go ahead and switch my x's and y's. So I'm going to put my taxa, my taxon variable on the y axis, my mean real bond on the x axis. So I'm looking through here now for anything else that I might need to change. So my scale x, y continuous should be scale x continuous. My x should be y and my mean relative abundance should be x. My axis text x should now be axis text y. And I think we're in good shape. Again, we've got our angle on. So we've got that the same problem, but in the opposite direction now. I'm going to come back down to my theme and turn off maybe all this stuff right now. So we see we have our genera on the y axis and our relative abundances going off to the right. I would prefer to have the bechrodedis and inner caucus up at the top and the other at the bottom. So we can change that back up here where we set the factor factory order and we did descending true. Let's do false and that should then flip the order. And sure enough now we have bechroedis at the top inner caucus and so forth with and with other at the bottom. Again, things are pretty tight. So I'm going to make this a little bit of a taller figure. And maybe I'll make it more narrow. So I'll go with the five height of six. And so now we see we have our genera across our y axis, our relative abundance across the x axis. I think that looks pretty nice. The bars are maybe a little bit thick. And so I might like might like to get a little bit more separation. So if you remember how we did that, well, we can come back up here to geom call. And I can do width equals let's do 0.6. And that gives us more separation between our different genera. So another thing I'm noticing is that our disease status groups are flipped from what I had originally. I would like to have healthy first, C diff negative diarrhea and then C diff positive diarrhea. So let's go back in and change that ordering as well. And we can do that way back up here, where we set that levels for disease stat. So I will take case and put it up here and put non diaryl control down at the third level. So again, we now see that our three bars for each of the genera are ordered in the order of our disease progression. That looks great. One thing I'd like to change is let's go ahead and move our legend. So it's not overlapping with our enterococcus bar there. So I'm going to nudge that down just a smidge. And so again, X, Y. So let's go ahead and put this up to say 0.6. That looks great. Our legend is no longer overlapping any of our bars. Let's go ahead and reduce the threshold where we're pooling data together. So kind of maybe see if we can make that other pool a little bit smaller. I'll come back up to where I was doing setting the tax on pool. So I said less than three. Let's go less than one. So we now see we have more taxa here, which is good and that our other pool is much smaller. I first noticed that my y-axis labels are perhaps a little bit too big. So let's go ahead and shrink that font a little bit. Another thing I'm noticing is Clostridium sensu stricto has those underscores. I hate those underscores in figures. When those slip through, I'm always just like, ah, people, people, format your labels. Anyway, so my understanding is that sensu stricto should be italicized, but we don't want those underscores. So we'll go ahead and change that as well as the size of the font coming back up to where we use that string replace backup here, we'll do tax on equals str replace all. So we want to replace all occurrences of that underscore, we'll do tax on, and we will then do quote underscore and then space. Next, we'll come back down to our axis text y, and I'll do size equals six and give this all a run. And now we see we have better separation of those labels and that our Clostridium sensu stricto is no longer has those underscores. Okay, good. So one last thing that I'd like to do is I'm generally not a big fan of grid lines. I think they kind of add clutter to a figure without really providing much information. In this case, however, our figure is going to be quite tall. And if I'm over here looking at acrimacia, it might be difficult for me to look down and say, is that to the 5% threshold? It's just not quite clear to me. So I'd like to do is maybe add vertical grid lines to our plot, something subtle, you know, a light gray, to make it easier to see kind of where are we in mean relative abundance space. So back here in our legend, we can do panel dot grid dot major dot x, because they only want the x grid line, I don't want the y element line, and I'll do color equals light gray. And I'll do size equals say 0.25. So I think this figure is pretty attractive for looking at genus level data. I like having those vertical grid lines to give me a reference point. Again, I'm not such a big fan of grid lines in general. But for a tall figure like this, I think it does provide reference points to know where those 5, 10, 15, 25% thresholds are for relative abundance. Of course, this plot still has some problems. We don't know the end for any of the three disease groups. I could, I guess, go ahead and put an n equals something for healthy, diarrhea, seed of negative, diarrhea, seed of positive, that could work. Another problem with bar charts in general, is that we don't get a sense of the variation in the data. Yeah, I could go ahead and put one plus or minus the standard error in here, but spoiler, these data are not normally distributed. And so putting in the standard error or standard deviation makes no sense. So as we come back to this data in future episodes, we'll see more and more how we can add that type of information. Finally, we don't know if any of these differences in any of the genera are significantly different across our three disease status groups. So that's something that we'll come back to in future episodes of Code Club. So with that being said, please be sure that you subscribe to the channel and click the bell icon so you know when a notification comes along. Please play with these data, play around with changing different ways that the data look. Definitely try this with your own data. Do you have a stacked bar chart that you've been working on? Well, what happens if you transition this to a bar plot like we've shown here? Again, a bar plot isn't where we're going to end up. But I think this is a lot better than a stacked bar plot. I've gotten feedback from people telling me my PI loves stacked bar plots. And so we have to use stacked bar plots. I feel your pain. I would encourage you to maybe share these videos with your PI can show them this relative to a stacked bar plot and ask them what do you think works better? These are the reasons I like a bar plot and see if those arguments hold any sway with your PI. Hopefully your PI can buy into that. And hopefully, maybe you can get your PI to watch these videos as well. And as we'll see in future episodes, we'll do a better job of dealing with representing the variation of the data, as well as the number of individuals represented by these bars. Anyway, keep practicing with all this. Let me know what your PI says. And we'll see you next time for another episode of Code Club.