 I love working with data from microbial communities, because microbial communities can be really diverse and have just busily in different populations. As you increase the number of populations, you get more and more skew to the relative abundances so that you might have some populations that are maybe around 10, 20, 30 percent of the relative abundance, but then a lot of other populations that are below 1 percent. So how do you visualize all those populations in one figure? It makes for a big headache. Stay tuned for today's episode and I'll show you one approach we might take to overcoming this problem. Hey folks, I'm Pat Schloss and this is Code Club. As microbiologists, we're generally used to looking at data on a log scale. If we're looking at bacterial growth, that's in the bacterial and exponential phase, if we put their growth rate or their growth on a log scale, say on the x-axis, and we get a line, then we will say, well, they're exponentially growing because, you know, they're in log growth. What we're going to do today is, instead of putting growth on a log scale, we're going to put relative abundance on a log scale. What this will do is this will help us to pull apart those populations that are down around 0.11, maybe even 10 percent of the community. One of the problems is that when you have wide range or skew in the relative abundance of your populations in a single figure is that if you have things that are more abundant, well, the rare things are going to just get compressed against the zero position on your axis where you're representing the relative abundance data. By putting things onto a log scale, we can kind of pull things apart visually by making that transformation on the x-axis. There is a trade-off to looking at things on a log scale. As scientists, like I said, we are generally used to looking at things on a log scale, but even as a scientist, I find myself having to kind of pump my breaks in my brain and say, oh, right, this is not on a linear scale. This is on a log scale. And so these are not individual or these are not linear increments, but they are log increments. So it might go 0.1 percent, 1 percent, 10 percent, 100 percent, right? 10 fold changes across the x-axis rather than 10 percentage point changes of 10, 20, 30, 40, 50, and so forth, right? I can count. Impressive, huh? Anyway, so that's the task for today's episode is to think about how we can convert our relative abundance data so that it's instead of being on a linear scale, putting it on a log scale. There's a few tricks along the way that I've learned by putting together this episode. This is actually my third time recording this episode. The first time I had a mistake. The second time I think I forgot to actually press record. And so here we are the third time's the charm. So I hope you like it. I've put a lot of work into this. All right. Let's come over to our studio. Here is the code that I am starting with. You too can get this code that I'm starting with by going to the link down below in the description for the blog post associated with today's video, where you can get the starting code chunk also across the top. I have a link to a video that describes how to install our, our studio, tidy verse and get the data that I'm working with so that as a first iteration, you can kind of replicate what I'm doing. And then you can say, Hey, what happens if I do this and you can go change something. And then the awesome thing would then be take your data and plug it into this code and see if you can get a similar plot, but for your data, that would just, that would be a win, right? And so that's what I'm really hoping for you from this episode. Okay. So we've seen this code in a number of past episodes where we've kind of been building this out. We read in all our libraries or metadata or to you data, our taxonomy data, we join this all together and we get relative abundance data. Um, we then are doing some cleaning up of those, um, the names of our taxa hate underscores and taxonomic names that then get published in figures. It's horrible. Clean up your underscores, clean up your utilization. Anyway, we then, um, look at the average, the mean relative abundance for all of our tax across three disease status groups. So these data were originally taken from a study where we're looking for microbial biomarkers to indicate the presence of Clostridioides difficile in human patients. So we have three populations of people. So we have healthy people with no C. Diff, people with diarrhea, no C. Diff, people with diarrhea and C. Diff. And so we're looking across those three groups again, for those populations that have an average relative abundance across all the subjects in that category, that's over 3%. If it's less than 3%, then we're going to pull them together. And again, we did this pooling, uh, because if the relative abundance is less than 3%, it's really hard to see that it's less than 3%. Right. It kind of looks like it's 0%. So maybe we'll come back and revisit this before the end of the episode and certainly in future episodes going forward. We then join all this together and we make a point range figure using stats summary. And so what point range does is it gives you a point to indicate the median and then we're giving it the 50th percent or 50th confidence interval, 50% confidence interval. So the intercortile range between the 25th and 75th percent tiles, and we're doing some nice cleaning up of our visual to make it look nice. So let's go ahead and run all this and I'll show you what we get. Here we go. Again, we've got our point range. The ball indicates the median. The range indicates the 50% confidence interval from the 25th to 75th percent tiles. And like I was saying, for the Bec droides, you know, we have some balls that are maybe around 28, 30%. But then for most of our populations from like the, you know, Allostipi's or maybe Unclassified and Urbactraceae on down, they're really rare, right? And so the average of one of these three disease treatment groups for each of the taxa was over 3%. Now I'm representing the median, comparing the mean, we'll clean that up again before the end of the episode. But what I want to drive home here is that these populations are rare, but we can't see if they're that much different from zero or how different from zero they are, and really how different they are from each other because they're compressed against the zero relative abundance point that anchor on our x axis. Coming to our GG plot code, something simple that we can do to put things on a log scale would be to do scale x log 10. We'll add that and we get a warning message. And so the warning message is because we have relative abundances that are zero. And so the log of zero is infinite. If I come down to my console and do log zero, it's negative infinite, right? And so GG plot doesn't know what to do with that. So what it's saying is it removed the 1571 rows where we had zero relative abundances to solve this. What I'd like to do is replace all those zero values with a number smaller than our limit of detection so that I can still represent it on the plot without having all these problems and kind of keeping it as part of our calculations of our median and interquartile range. To do that, I'm going to come way back up to the top where we have our O.T.U. counts data frame. And so as you can see, we have a column in this data frame for the sample ID, the O.T.U. and the number of sequence counts for that. And what I want to know is how many reads do we have per sample because one divided by that is going to be our limit of detection. So I will pipe that to group by and I will then do sample ID and then we'll do summarize and big N. It will be the sum on count and I'll do dot groups equals drop. And so now I see that at least for the first 10 or so samples, I have 1500 and seven sequences per sample to make sure that everything has the same number. I'll do a count on N. And so then I see 338 samples had 1500 and seven sequences, which is a win. I'll then do poll on N. And so that will take the end column and turn that into a vector, which we get 1507. I'll then do N seeks per sample as the name of this vector with one thing in it. And then to add a test, I can do stop if not length N seeks per sample is one that runs without any errors. So why would we get an error, right? Well, if we hadn't sub sampled the data, I'm not going to talk about sub sampling data. And I did that upstream of this data set. If I had not sub sampled the data, then all of my samples would have different numbers of sequences. And the analysis I'm doing is assuming that everything has the same number of sequences. So that's a little test so that if we repurpose this with other data to make sure that all of our data has the same number of sequences. Okay. Now what we can do is also a limit of detection LOD is one divided by N seeks per sample. And I'm going to multiply this by a hundred because down below my relative abundances are as per cents. So if I multiply it by a hundred, then I have my LOD as a percent. So my limit of detection is 0.066 percent. Now what I want to do is I want to come back down into my code and I am going to replace zeros with a number smaller than my limit of detection. And I'm going to do that in this final inner join where we're kind of getting everything ready before we plot it in this first part of that chunk where we're joining everything together and plotting it. I'm pulling together those taxa where the average relative abundance for at least one of the three treatment groups for each taxa is above three percent. So I'm going to add another mutate and I will then do relabund equals and I'll do an if else relabund equals equals zero. If that's zero, so if that's true, then I'm going to do two thirds times my limit of detection. Otherwise I'm going to say just stick with relabund. Okay. And so now we can add this in. And if we run it, we shouldn't get any error messages. I know error messages and we get our plot and we now see that our X axis is on a log 10 scale, which looks really nice. And so we see a lot more separation of those low abundance populations, right? So those populations that were at or below one percent, they all looked like they're kind of stacked on top of each other. And so because we now have things on a log 10 scale, it's easier to see the separation and to see they're different or distinct from our limit of detection. Something I want to call your attention to, however, is that we use scale X log 10. There's another option that we might take, which would be to do cord trans and then say X equals log 10. And that gives the same output, right? The X axis styling is a little bit different, but this gives us the same output. I am going to encourage you to use cord trans. So remember I said I recorded this three times? Well, the first time was because I wasn't aware of cord trans. I just learned about stats summary a few episodes ago when someone in my lab did a code club for own research group talking about stats summary. And so since then it's been like my new favorite thing. I didn't know, however, that stats summary operates after scale X log 10. And so what that means is that if you're calculating the mean, the average, then you're doing that after it does the log 10 of all your values. So let's look down here in our console. If I have a vector that I'll call X and say I have one through five, right? And I do mean on X the mean is three, right? Well, if I do log X, I get all those log values. And if I pipe that to mean, I get 0.95 and then do 10 to the Y. I get nine, which is not what we should be getting, right? We should be getting three. But that operation, that series of operations of calculating the log on all of our numbers and then the mean and then taking that mean and raising it to from the 10 to that power is what's happening if we do scale X log 10 with stats summary. That stats summary operates after the scaling functions. In contrast, if we do cord trans, cord trans works after stats summary. It's an important distinction. It doesn't really matter for median high low because if I do again, median on log 10 X, I get 0.477 and I can then raise that to median log that and I get three. Or if I do median X, I get three, right? And so that log scaling doesn't affect the median, but it does affect the mean. So to give you a sense of what this looks like, I'm going to go ahead and for the time being being do mean S E and I will get rid of this confidence interval because the range is going to be set by one plus or minus the standard error. The mean doesn't make sense for our data because it's so skewed of a distribution, but it will help to kind of illustrate a point here. And I will then also get my scaling on the X axis to be the same. So we'll do scale X continuous and I'll do breaks equals 0.1, 1, 10 and 100. And let's go ahead and let's sign this then to Schubert genus. And this was my scale. This was my trans. And let's go back and let's look at what would happen if we did scale X log 10 instead. And again, I'm going to use the same break. So we're making a more apples to apples comparison. So this is the depiction using scale X log 10. And this is using the transformation, the cord trans. And you can the alignment of the X axis isn't perfect, but you can hopefully see there's really big shifts in the data when we're using the cord trans versus scale X log 10. You want to use cord trans if you're using stats summary. And again, the difference is that scale X log 10 operates on the data before doing stats summary, whereas cord trans is performed after running stats summary. So I'm going to go ahead and turn off the scale X log 10. Turn back on my cord trans. I'm even going to delete this. And I'm going to convert this back to median high low. And also, again, because the median is robust to log scaling, it doesn't vary. If you log scale, we didn't see a difference using the median, but because I know that people watch these and apply these to other applications, I wanted to make sure that you knew there was a difference if you are using the mean. If I'm using stats summary in the future, I will continue to use cord trans, because, you know, sometimes we we switch back and forth between different summary statistics that we're wanting to plot. And so I think it's safest to use cord trans with log 10, then to do scale X or scale Y log 10. Then we have the scale X continuous, which doesn't do a transformation on the data. It presents the data as they are on the X axis that we can use for putting in different styling of the data. So again, this is what our plot looks like. As always, there's a number of things I'd like to change. One thing that this has me thinking about as we've been talking about means and medians is that when I talked about pooling the data way back up here with tax on pool, I was looking at the mean relative abundance across individuals within each of the three different treatment groups or disease status groups for each of my taxa. If I'm plotting the median, I should probably be using the median throughout and not switch back and forth between the mean and the median. So what I will do is I will go ahead and find and replace mean and replace that with median. And so now we have fewer populations where the median for any of the three groups for any of the populations was greater than 3 percent. What I'll do now is I'll come back up to my pool and instead of doing 3 percent, let's go ahead and do 1 percent. And so now we have more populations being depicted in here that were one of the three disease status groups has a median value greater than 1 percent. Right. And so you can see that like this parabaque droides for healthy is above 1 percent. This again for LSDP is above 1 percent and so forth. Okay. So we are in good shape there. The next thing I'd like to do before I forget is add a line to indicate the limit of detection. I will do that back up here. I'm going to put that after my stats summary. I'll do a GOM V line and GOM V line will draw a vertical line. GOM H line will draw a horizontal line. GOM V lines argument is X intercept. And then I will say equals limit of detection. This now gives me a strong vertical black line to indicate that limit of detection at the one divided by 1,507 times 100 percent. You know, I think it was like 0.066 percent. And one thing I'm not totally thrilled with this line about is that it's the same appearance as my Y axis. I'm going to make it a little bit thinner to kind of distinguish it. Also, if you zoom in and look really close, you'll notice that black line is on top of my range lines. I can change both of those. So first I can change the order by putting that V line after GG plot before summary. So the V line will go down and then on top of that, we'll draw those ranges. I can then also do size on those zero zero point two. Again, this gives me a thinner black line behind the ranges of my data. And so this looks pretty nice for that to indicate my limit of detection. Again, probably in the legend for this figure, I'd want to indicate that that was my limit of detection. The limit of detection was 0.066 percent or something like that. All right. The next thing I want to take on is our legend. It's like always in the way. And I've been kind of bumping it and moving in recent episodes. I'm going to go ahead and move it outside of the plot finally. And I can do that by commenting out my legend position argument in the theme function. I'm going to go ahead and make my width a little bit wider. Make it seven inches. And so now you can see that we have that legend outside of the figure. One of the reasons I like to have the legend inside the figure is that now we have all this extra white space. And it just just takes up space, right? Like this is kind of like dead space that I'm not I'm not totally thrilled about. Sometimes people do put kind of a horizontal legend across the top of the figure. I don't really like how those look either. Kind of I'm kind of picky, aren't I? Anyway, we'll leave that there for now. And I think that looks that looks pretty decent. OK. One other thing that I don't like about this is that on the X axis, these labels are 0.1, 1.0, 10.0 and 100 or actually I don't have 100. And I would like to have the 100 and I would I don't want this to be 1.0 or 10.0, 100.0 I want to be 1, 10, 100. I don't want that trailing 0.0. Again, I mentioned that we can with chord trans we can use scale X continuous to stylize the appearance of our X axis. And so instead of breaks, I'm going to put in labels. So I'm going to have breaks and labels. Also, I noticed that I didn't have the 100 position. And that's probably because I need to set my limits. And so my limits are going to go from NA to 100. And what that NA to 100 means is that I want it to include 100. So 100 needs to be represented. The NA means pick the smallest value that's appropriate for the data. We've done this in the past where we've done like with a bar plot or any other plot, like zero to NA. So we wanted to ensure that we had zero represented on the Y axis, but we wanted to pick kind of the upper bound based on the data. So again, that NA says pick that bound, whether it's the lower or upper bound based on the data, but otherwise, in this case, use 100 as that upper bound. And so now we do have our data going out to 100 and that we have the nice simple styling of our X axis. I think this visual looks pretty solid. One thing I like about having things on that log scale on the X axis, again, if you had things flipped, it'd be the same same syntax, but instead of X equals quote log 10, the Y equals log 10. And then instead of scale X continuous, you'd have scale Y continuous. So one of the things I like about this again is it gives us more separation for these low abundance populations. If you go back and you look at what things look like when we had a linear X axis scale, these populations would be kind of jammed across the left side of the plot. In fact, we can go ahead and look at that by commenting out this chord trans. And so we see a lot of things are jammed across the left, right? And of course, I have things, you know, going to 100 on the X axis, but you get the idea, right? Like the Blaudia, the Parabactoides, the Alice Dippies, you don't know, are those values zero or not? But again, by including the chord trans, we can put things on that log scale and see that, yeah, I mean, like the median for Alice Dippies, for people that have C. Diff, and the median for Enterococcus for people that are healthy are zero. But otherwise, these median values are all greater than zero. And again, the difference between 100% and 0.1% is a few logs, right? And so that, you know, might be a population that's like 10 of the six cells in a sample. So it's still a considerable amount of bacterial load, right? Anyway, I think this works well for scientists. As I was saying earlier, if we were then to give this to the general public that perhaps isn't used to looking at things on a log scale, this might be where we start to get some confusion. I can look at the Bacteroides and see that this difference, you know, isn't one unit, right? That it's actually a tenfold difference in relative abundance and so that's a much bigger difference, perhaps than they might otherwise think, right? So like the absolute difference of, you know, healthy to diarrhea for Bacteroides is much larger, perhaps, than the difference for Alice Dippies, right? Because this isn't a linear difference, right? So this difference between red and blue and gray for Bacteroides is tenfold different. But you might look at the X axis and say, oh, that's only like three percent difference for Alice Dippies. And that's true. But it's like, I mean, it's close to like a hundred fold more abundant. Alice Dippies is a hundred fold more abundant and healthy people than people with diarrhea, whereas Bacteroides is maybe tenfold, right? Anyway, I hope you really are working in parallel with me through this code to build this figure and that you go back and you start tweaking things, experimenting with things. What would this look like if you did it, say, at the phylum level or pick a different taxonomic level? I'd really encourage you to go back and give that a shot. If you want to flip the axes to put the relative abundance on the Y axis and the names on the X, that also be a great exercise to do. And of course, the best thing you can do is take what we've talked about today and apply it to your own data so that you can have a visual like this that is customized to your specific research question. That would make me feel so good to know that you had done that. If you have other questions about how to work with data in our for some type of analysis you're doing or some type of visual or if you say, Hey, Pat, you know, what if we were to do this with the figure? I would love to hear that as well. Please be sure to like and subscribe this video and the channel. Don't keep it a secret. Tell your PI about what we're doing here. Tell other people in your research group about what's going on here. I would love to have more people involved and more people interacting with me as I post these videos and solicit your feedback. Anyway, keep practicing and we'll see you next time for another episode of Code Club.