 Microbiome relative abundance data, whether it's human or environmental or wherever, is really difficult to visualize. Why is it difficult to visualize? Well, each individual sample that you're looking at can have a large number of taxa, perhaps dozens of different taxa. Also, those taxa are generally not found in every individual or every replicate from the group that you're looking at. All of these things combine to make it really challenging to analyze and compare microbiome data. In today's episode of Code Club, I think we're gonna get really close to a solution for this challenge that we've been working through in the recent episodes of Code Club. Hey, folks, I'm Pat Schloss. This is just a challenge that I have been struggling with for as long as I've been working in microbiome data, and I know that I am not alone, that nearly everybody that works in microbiome data really has this struggle of how do we best represent relative abundance data? If you've been following along in recent episodes, you know that we've talked about stacked bar charts, we've talked about pie charts, we've talked about heat maps, we've talked about faceting stacked bar charts, we've done so many different things. And I think in the last episode, we got really close to the right answer, and today we're gonna flesh that out further. You'll recall that in the last episode, we made side-by-side bar charts where we grouped the bars by the different taxa, and then for each taxa, we had three different disease status groups, and this made it much better than what we'd been seeing with, say, stacked bar charts because we could put everything on the same basis of comparison. Everything was pegged at a 0% relative abundance at the bottom of the plot. That way, then, we can look within each taxa to see how does that taxa vary across the three different disease status groups. That's just something wasn't possible really with pie charts or stacked bar charts. The problem, though, with those bar plots is that we didn't have a sense of how many subjects or samples were in each of the bars. We also didn't have a sense of the amount of variation in the data. In today's episode, we're gonna go back to an approach that we saw when we were talking about diversity data, and we're gonna build a strip chart or jitter plot depending on how you wanna call it. We'll do some nice stylizing with it, and we'll also put a line across each of our clouds of points to indicate where the median is to make it easier to see where the central tendency is in the data, the variation in the data depicted by that cloud, as well as the number of individuals that go into each cloud by being able to see each individual point. Come along with me into our studio now and we'll see how we can get this done. As always, I strongly encourage you to get the code that I am working with down below in the description. There's a link to a blog post that's associated with this video. You can get the code that I'm starting with. Also, up across the top, I'll put a link to a video that describes where you can get the data, how you can download the data and put it in the right place, how you can get R, set up our studio and just get everything you need so you can follow along. I really, really, really encourage you to follow along because along the way, I'm not gonna be able to get to everything that you might imagine doing. So I'm gonna say, hey, why don't you go work on this? And based on principles that you've learned in other episodes of Code Club, see if you can get it to work with a strip chart, okay? So today we're working through this code. We go ahead and we read in the metadata that tells us the disease status, the OTU data that counts for each OTU, the taxonomy information. We then join this all together and get the relative abundance for each OTU for each subject. We then look at taxon relabund as a data frame we're generating where we're filtering at the phylum level. We're then aggregating together all those OTUs from each subject that belong to each phyla. We get the relative abundance and then we identify those taxa, those phyla that we can pool because their maximum abundance on average across all subjects from one disease status group is less than 3%. We then pool those phyla together by joining that taxon pool data frame with our taxon relabund data frame. And then we throw it all together to make a side by side bar chart. And so again, we see in this figure, we have the average or the mean, which really only makes sense if your data are normally distributed. We have no sense of the variation in the data. I'm just gonna tell you right up that the data are not normally distributed. They're highly skewed oftentimes. And so the mean doesn't make sense. So we'll use the median. We have no sense of the variation in the data. So it didn't make sense to put on like a plus or minus standard error, standard deviation. And again, we don't see an end. So by the end of this episode, we're gonna convert this side by side bar chart into a strip chart where I think we will overcome those problems. But unfortunately, you might identify one or two other problems because, hey, no visual is perfect. So the first thing that we need to do is break apart the data so that we're not aggregating things together by disease status, but that we're keeping things separate at the subject level. We've done this before in a previous episode, but just to remind ourselves, if we look at O2 relabund, that does keep the subject sample ID. Now down here in taxon relabund, we go ahead and we group, we filter at the phylum level to pull out the phylum information. We then group by disease stat, sample ID and taxon. And then we then group things together, group O2 used together that are from the same phylum. And then down here in lines 43 and 44, that's where we aggregated across the disease status groups. So I'm gonna go ahead and remove those two lines, but I'm gonna add back in the 100 times my sum relabund so that I can report my relative abundance as a percentage. Let's go ahead and run taxon relabund, and we see that we have our disease status group, our sample ID, the taxa, the phylum that it's coming from, as well as the relative abundance. In this next code chunk, we're creating a data frame called taxon pool that we will use to tell our which phyla or which taxa in general, we wanna pool together because within all three of the disease status groups, there isn't a disease status where the relative abundance on average is greater than 3%. Now, when we did this before, used this before, taxon relabund was already aggregated at the disease status level. We still have things separated at the subject level, so we need to modify this code a bit. Again, we have taxon relabund down here, where we have the disease status, the sample ID, taxon, and relabund. So I'm gonna group by my taxon. I also wanna group by my disease status, and I am then going to go ahead and get the mean for the disease status and for that taxa. So we'll do summarize, and we'll do mean equals mean relabund. And again, we're not going to plot this mean, we're using this to figure out which phyla to pool. And so again, we now have the mean relative abundance. I'll go ahead and do dot groups equals drop. And I will now pipe this into another group by, where now I'll group by taxon, and I'm gonna summarize to say pool if the max mean value is less than 3%. So again, if the mean for any given phyla across all three disease status groups is less than 3%, we wanna say pool equals true, because we wanna pool it. Also then to sort our phyla in terms of relative abundance, we also need then the mean across all three disease status groups. So this then gives us taxon pool, which we can now see has the taxon name, whether or not to pool it, and the mean relative abundance across all three disease status groups. Next, we're gonna go ahead and bring this into our inner join with taxon relabund. We then mutate taxon so that if pool is true, so if we should pool it, then we're gonna call it other. Otherwise, we're gonna use the taxon name, we'll then group by disease stat and taxon. Actually, we also wanna add in here sample ID because we wanna join together those tax of relative abundances for each person separately. We're also leaving the disease stat in here because then when we finish the summarize, we'll still have the disease status for each of the subjects. And so here then, we're gonna do some on relabund and I will also call this relabund. And then the mean, it will be the smallest value of the mean. Again, this is used for sorting our different phyla. And then we've got all this good stuff for ordering our factor of the taxon, ordering our taxa using factor with FCT reorder using that mean column to determine the order of the data. And we're gonna do a descending sort from most abundant to least abundant. Good. Now, we're getting into the good stuff of making this strip chart. So on the X axis, again, we want our taxa, the Y axis, we want relabund, not mean relabund. And then our fill will put disease stat. I'm gonna switch this to color because with the strip chart or jitter plot, we're gonna use those circular PCH symbols to represent the data. And so that uses color as the color of the symbol. So for geom call instead, I'm gonna do geom jitter. And let's go ahead and use that and see how far this gets us. I'm also gonna use scale color manual. And so that'll be good. And yeah, let's give this a shot and see what we get. Cool. So it's a strip chart. Unfortunately, we have our three clouds of points on top of each other. So we need to pull those apart. And so again, this is very similar to what geom call does on its own. So geom call with no arguments stacks the data on top of each other. Well, we're stacking the data on top of each other here. So do you remember from that episode, the last episode, how we pulled that apart? Well, we use a position argument. So we said position equals position dodge. Here we're gonna use position equals position jitter dodge to pull those clouds apart. So we'll come back up here to geom jitter and we'll do position equals position jitter dodge. Say that a bunch of times real fast. And we see that we now have our three clouds of points for healthy diarrhea, seed of negative and diarrhea, seed of positive in red. That looks good. Before we go too far into talking about position, jitter dodge, one thing I wanna come back to is we notice that some of our points that are at zero, we only kind of see half the circle and those that are up towards a hundred, we only see half the circle. So that is because up here in Scale Y Continuous, we used expand C zero to zero. And again, that was what we used with the bar charts to have the bars start at zero and at a hundred and have the axis come up to the zero point. And now we see we have a little bit of breathing room around zero and a hundred so that we're not truncating the shape of those points. Let's dig into position jitter dodge a little bit. Know that position jitter dodge has two arguments that we generally use. So the first is jitter dot width and we can say 0.2. And then we can also say dodge dot width equals, let's say 0.4. So what jitter width does is it affects the width of the jitter. How much jitter there is from left to right. And so remember that the x-axis jitter is not adding information. It is a random placement of the points on the x-axis. So if we make it a really narrow width, then if we went to like zero, then we'd have a vertical column of points which wouldn't be too helpful. We could make it wider, but then again, the problem that we run into is that our various columns or sets of columns for the different taxa might start to overlap with each other. The dodge width then says how far apart should the different clouds within a grouping be? I'll go ahead and let's see what happens if we expand the jitter width to say 0.4 and let's do dodge width of 0.6. So now we see we have a wider jitter, but also you can kind of see that the points are falling on top of each other between the three different disease status groups. And again, that's because our dodge width wasn't as wide as it we might have liked. So if I put dodge width to be 0.8, now we see that we have a little bit separation and I kind of like this because we have enough separation also between our different disease status groups. And know that I often have to play around with the position jitter dodge arguments to get the right look for what I think looks good. So again, rarely do I come upon a value for jitter width and dodge width that I'm ultimately happy with. So anyway, just know that you have to do a little bit of tweaking on that. You should also be aware that position jitter dodge has a jitter dot height argument and that affects the jitter in the vertical dimension on like the y-axis. So I would strongly discourage using that for a plot like this because you're going to be moving things, basically adding information to the data that's not really there. Finally, one last thing to be aware of is that as we run this multiple times, the points are gonna move around. And so one way that we can cure that problem so to speak is by setting a seed. That position again in the x-axis is random. So up here after my library calls, I like to do set dot seed. And then I'll use 1976, 06, 20. That is my birth date in ISO standard notation, June 20th, 1976. Use one, use your favorite number, use your birth date, whatever. Don't pick a number to give you the results that look the way you want them to look. Our results in general should be robust differences in the value of the seed value. And again, what this is gonna do is every time I run this script, as long as I run that set seed, my points, even though they're in a random position on the x-axis will be at the same random position on the x-axis. And so that helps with reproducibility. So now when I run my entire script, I should get the same appearance of these points run after run. Something that concerns me a little bit about this plot is my legend. It looks fine, right? But one thing that often worries me is that if I'm showing all the data, is there maybe a point behind the legend? I don't know. We should look. Also, it can become a little bit confusing to your viewer if they see points in your legend. And I think, oh, is that an extension of these points somehow? So what I'd like to do is confirm that there's nothing there under the legend. And then I'd like to maybe move the legend kind of more out of the way of the data and put a border around it, put a box around it. So it's clear that the things within this boundary are not data. It's part of the legend. So down in my theme function here, I do have the argument legend.position. This was how I set the position of the legend to be inside the plot rather than in the right margin. Let's go ahead and run this and we'll see if there was anything actually under the legend. And sure enough, there was. There was a diarrhea seed of positive red point there that was no doubt hiding behind the legend. So what I'd like to do is let's go ahead and move that back, but let's lift it up and make sure that wherever we put it, we can see that red point. So again, with legend position, I'll go ahead and the 0.8 is the X position relatively speaking and the 0.6 is the Y position. So let's go ahead and put that up to 0.9 and we now see that that is moved up and out of the way of that red point. Let's go ahead now and put a box around our legend and we can do that also with legend dot background equals element rectangle, rect, and we can then do color equals black and that should give us a black border around our legend. The other thing I can do is fill equals NA and what that will do is that that will give us a transparent legend box. And so there we go. We have that nice black border around our legend. One thing I noticed is that I don't have the top line on my legend. I think that's because for some reason it's leaving a little bit of space in there for the title of the legend. I can fix that by doing legend dot margin equals and then I can say margin and I'll do T equals, I'm actually gonna do say minus 10. I don't know that this will necessarily work but I see from the defaults of margin that the default margin is zero and so let's go ahead and if I set the minus 10 then that should bring it back not up and sure enough that worked pretty well, maybe a little too well. So let's go ahead and add a little bit of margin space and so instead of minus 10, let's do five. That looks pretty good. Now I need some extra space on the right side and the bottom so on the right I'll do three bottom equals three and so that looks pretty good. I'm pretty happy with the spacing of that border around my legend. Let me show you what this fill NA does if we actually put it back down at 0.6 and so there you can see with fill equals NA it makes it transparent. By default the fill is white and so we can see that there's that red point between my healthy and diarrhea there because I've made that transparent. Let's go ahead and put this back up to 0.9. So I like how this looks. I see the points, I see the variation of the data, I see the number of subjects that we have but I don't have an idea of like where the central tendency is in the data. I'd like to put a line segment across my cloud indicating where the median is for the relative abundances of these different filo. To do that, we're gonna come back up and after GM Jitter, I'm gonna use stat.summary. So we talked about stat.summary many episodes ago. I was really excited about this new thing. It had become like my new favorite thing and so we can get segments by doing fun equals median so that's function equals median and then GM equals crossbar which gives you the median line for a box and whisker plot and let's go ahead and add that and let's see where we are after running this. Very cool, we now have our crossbars and they correspond to the three colors of the three disease status groups and like we have seen with that position argument it is stacking them on top of each other. So if we come back up to stat.summary, I can do position equals position dodge like we saw with those bar charts and so now we see that we've got our bars. They're a little bit hard to see because they're the same color as everything else and something I noticed is that the width of my bar is not the same as the width of my cloud and so let's go ahead and see if we can fix that a little bit and so again in position dodge let's go ahead and try width equals 0.4 and so that's bringing those bars in. I think maybe we wanna do something more like 0.8. I'm trying to match my width in position dodge with a value from jitter width dodge width. Maybe I should have just started with dodge width in the first place because that's what that's the value we used with dodge width. So that looks pretty nice. The width is maybe a little bit too wide so let's go ahead and we can then add width as a argument on its own and that will be the width of that segment. So if we do width equals 0.4 I think I made that too small. So again, let's do 0.8. As I said, I generally play with these parameters until I get something close to what I like and so now that segment is the same width as my cloud and it overlaps the cloud nicely. So a challenge with this though is that because the bar is the same color as the point it's kind of hard to see, right? Like this, especially with the gray colors it's hard to see my bar indicating the median. So I'm gonna do a trick. In GG plot, we cannot use the same aesthetic for two types of information. So I can't make the bar black when I'm using color as the plotting symbol color, okay? So what I'm gonna do is up here in geom jitter I'm gonna do pch equals 21. So pch 21 is in that series of plotting symbols where you have a circle for 21 with a border. The border is set by the parameter by the aesthetic color and the color of the circle is set by the parameter fill the aesthetic fill. So what that means then is that if I do fill equals disease stat, then my fill color will be gray, blue, red and my border will be black. My color will be black. And so again, my stat summary that should be black and I'll do scale fill manual instead of color manual. And sure enough, we now see that we have our circles with that border. If you like that black border, awesome. I'm gonna show you what we could do to maybe get rid of that black border while not losing the black bar. What we can do is we can come back up to geom jitter and I'm gonna do stroke equals zero. Stroke is the width of that black border. And now you'll see that we got rid of that border but we still have the black bar for our median. That does make the points appear a little bit smaller. So again, we can come back up. We can do size equals say 1.8 and that gets our points back to being a decent size. So a couple other styling things. I feel like the bar is maybe a little big, a little thick. If I wanna change the thickness of that bar again here in stat summary, we can do size equals 0.25 and that gives a more subtle indicator of the median which I'm pretty happy with. The other thing is that I'm not totally a fan of having the box and the line around my plotting symbols. So what I'll do here is I'll come back and on stat summary, I'll do comma show.legend equals false. And so there we get back to our individual points. And so that all looks good. So I'm pretty happy with the way this looks. This is my preferred way of visualizing relative abundance data from the microbiome. One challenge with this, however, is that we might be on the verge of having way too many points. I think there's 130 healthy points here and something like, I don't know, 100 and 100 of the two diarrhea categories. And so we might be on the verge of having too many points to do this. And so in the next episode, we'll look at other ways of depicting this data but there we're gonna sacrifice then not being able to show all the data. So it's a trade-off, right? The reason I like this again is because I can array everything across the x-axis or if I have a lot of different taxa, I can flip it and put all my taxa on the y-axis with relative abundance going out to the right. Let me know what you think of this representation of relative abundance data. I think this is vastly superior to stacked bar charts or pie charts. And as we go through future episodes, we'll look at other ways of representing data but in the same structure where we're gonna be able to see the variation in the data, again, binned by taxonomic grouping on one axis. Also, as we go along, I know that if we go to the genus level or OTU level, there might be hundreds of different genera or OTUs. We'll start talking about other ways that we can begin to think about filtering our data either by abundance or by significance so that we can only look at those taxa that are most important to us, however we define most important. See if you can take this and flip it 90 degrees so that we have the phyla on the y-axis and a relative abundance on the x-axis. If you're feeling ambitious, go ahead and try that also by using genus rather than phylum. Anyway, I hope you found this interesting. I hope you agree with me that this is a good way, compelling way to visualize relative abundance data. If you have a critique of this, let me know. I would love to work that into a future episode and perhaps we can begin to think about ways to improve this. I don't think it's perfect. I do think there's challenges with this, again, when we have as many points as we do with this. But if you're doing study where you maybe only have 10 samples in each of the groups, I think this would be perfect. Again, there's gonna be little other tweaks that we can do along the way to make it a little bit better. But again, I'm pretty happy with the way this looks. Again, please be sure to engage in those practice exercises that I gave you of different ways to manipulate this or to look at the genus data. I think you'll really hone your skills by doing these practice activities. Please be sure to share this with your PI if they're not watching or if you're not the PI and share it with your colleagues and be sure to kind of discuss these ideas with the people around you. And I would love the feedback again. I'm serious. I am happy to receive constructive criticism on these visuals because nothing is perfect. We can always make things better. So anyway, keep practicing and we'll see you next time for another episode of Code Club.