 It's the middle of May, but I live in southeastern Michigan and it still freezes at night. Let's warm things up a little bit today by talking about heat maps. Hey folks, I'm Patch Loss and this is Code Club. Hopefully that will be the last dad joke of the episode, but you know what, I really can't help myself sometimes. Anyway, if you've been following along in recent episodes of Code Club, you know that we've kind of been marching our way through visualizing different types of data generated by the microbiome field. We talked about ordinations, we've talked about alpha diversity metrics, and now we've been talking about relative abundance data from our microbial communities. In the most recent episodes, I've looked at things like stacked bar charts and pie charts. Both of those approaches leave a lot to be desired. Today we're going to talk about heat maps. If we think back about those stacked bar charts, one of the struggles with stacked bar charts is that as you get more and more attacks or more and more categories that you're trying to characterize as being parts of a whole, then those wedges, those rectangles on a pie chart, the pie slice, keep getting smaller and smaller. Also, they need to be depicted by more and more colors. And as you have more colors that you have to represent, the ability to discriminate between the different colors gets a lot harder. A heat map is one approach that we might take to solving those problems with stacked bar charts and pie charts. On the x-axis, we can put a variable. So in our case, we can put the disease status on the x-axis. And on the y-axis, we can put our different taxa. The values then that are plotted are rectangles or tiles as we will call them. The fill will be set according to the relative abundance. And so we might get a darker color or a lighter color depending on your perspective if you have a higher relative abundance. And then the opposite if you have fewer of that type of bacteria present or whatever the category is that you're representing. So another way of thinking about it, it's kind of shifting the fill color that we had for the stacked bar chart to the y-axis and putting the y-axis from the stacked bar chart now as the fill color for the heat map. So this solves that problem, like I said, of having too many wedges in a stacked bar chart or a pie chart and the problem of having too many different colors making it hard to discriminate. What's the problem with the heat map? Well, a big problem with the heat map is that it becomes very difficult to distinguish between different colors along a continuous gradient, right? So the ability to discriminate, say, between 20 and 30 percent red is really challenging. Also, we perhaps remember from that dress, the blue and black or was it golden black dress, that people perceive colors differently depending on the colors that are surrounding that color. So color really isn't a great pre-attentive attribute to make a quantitative variable. I think it works pretty well as a qualitative variable, but perhaps not so well as a quantitative variable. Let's go over to our studio where I can show you how we can build these heat maps to overcome some of the problems that we saw with stacked bar charts and pie charts, as well as some of the problems that we now have because we're mapping a quantitative variable to a color and perhaps some of the other challenges that we still face with heat maps. You can get the code that I'm working with from a blog post associated with today's episode. The link for that blog post is down below. Also, if you want the same setup that I've got, there's a video episode that I'm linking to up above in the video. Go watch that video, get things set up, get the data, get the packages, everything you need to kind of play along with me. I really encourage people to get the data to work alongside me, either with my data or with your data, and that's really the best way to learn this material. Again, this is the code we're starting with, should be fairly familiar to those of you that have been following along in the past few episodes. We load our libraries, we get our metadata, we get our OTU counts, our taxonomy information, we join that all together and make it tidy. We then calculate the taxon relative abundance. We're doing that at the level of phylum. Perhaps by the end of the episode, we'll maybe go down to the level of genus. We're doing some styling so that we can get taxonomic names to be italicized. And then we are pooling any phyla or taxa where the maximum abundance of that phylum in any of our treatment groups is less than 3%. We then join that all together. And what I'm doing here is we are building a stacked bar chart. So we're not going to build a stacked bar chart, but what I want to show you is it's fairly straightforward to pivot from a stacked bar chart to a heat map. And so this geom call here is doing all the hard work of building that stacked bar chart. And so what we can put in here is geom tile. And so geom tile will create the heat map. You can think of a heat map as kind of a mosaic of rectangles, right? And so we're going to change our aesthetics. So x is going to be our disease status still. Our y, however, is not going to be mean relabund. That's going to be our fill. So our fill will be mean relabund. And our taxon will now be on the y axis. I'm going to go ahead and turn off some of these scales because I'm pretty sure they're all going to bonk on us and also those labs because that won't quite make sense. So we'll come back and play with that later. And I will change my file name to be Schubert heat map. We're getting one of these set viewpoint errors. I think that comes in from our theme. So let me go ahead and comment that out for now along with that plus sign. And here we go. We have our heat map. We have our taxa on the y axis. We then have on the x axis, the disease status. And then we have the coloring by mean relabund. This is opposite my intuition. I typically think of dark as being more or being more complete, more full. This is the opposite. So where it's darkest, the dark blue here, it's absent. And where it's lighter, there's more. So there's a few things that we need to clean up in here. So let's start with cleaning up our x and y axis labels. I'm pretty sure those are not rendering properly because we turned off the theme function. So I'm going to go ahead and remove these comments so we can get that executed as part of our pipeline. So we need axis text x element markdown, as well as axis text y element markdown. And so we now see that our taxa names are italicized and our disease status is also italicized. So that looks great. The next thing I'd like to do is clean up our x and y axis labels as well as the name of our legend. And so under labs, I'm going to go ahead and cut out the y value and put null. And then also for scale fill, and then we'll do gradient. And then I will put name equals mean relative abundance. And let's see what this looks like. And so great. We no longer have that x or y axis label. Our title here is quite large. I'm going to go ahead and put mean relative a bond on separate lines. So I'll insert br, the HTML tag to add line breaks and put that there. And also I'm going to do legend.title equals element markdown. And I'll do size equals eight. And so that looks smaller. It's kind of funky in the way it's justified. So we could also do legend.title.align. And I misspelled legend, of course. And we'll then say 0.5. So you give it a numerical value. And so we now see mean relative abundance with a percent. It kind of looks funky with that percent there. I think what I'll do is put that make that relative a bond percent. So this looks pretty good with that legend.title. Something else that's a little out of whack here is the order of my phyla. You'll recall that in our code here where we're mutating the tax on, we ordered it as a factor in decreasing order of relative abundance so that we could then use factor shift so that unlike a stacked bar chart we'd have something at 100 and something at 0 that were the most abundant phyla. And also that way then with the pie chart at that 12 o'clock, 6 o'clock axis on a clock face we would have those two phyla at the top to have a common anchor point. So we don't need this factor shift anymore. So we'll go ahead and remove that. I'm not sure about that descending sort if that needs to be true or false. So let's run it and see what we get. And so we have other at the top. So I think I would prefer to have formiccities and vectorities up at the top because again as you come to this figure I'm going to start in the top left with my eyes and kind of read back and forth in a Z pattern. So yeah we want that DESC to be false which is the default value by the way. And so now we have formicities up at the top vectority is pretty back to your Rucco and other. So the next thing that I want to take on in this figure is the color scheme. I am not a fan of having dark represent zero. I tend to think of zero as being empty and not having anything. And when I think empty I think white and that then as I add more and more abundance I want to fill in with more color. So I'm going to have it go from say white to red which is the traditional heat map color for for hot right for for lots of stuff. So we can do that back up here in scale fill gradient. Then I'll use low and I will give it I could give it white for the lower bound but I'm going to give it the hexadecimal which is ff ff ff. And so that's you know the first two characters are red and then the next two are green and the next two are blue. And then high because that first one is red I'll do ff 0000. I like using the hexadecimal because it gives me a little bit more power in picking my colors. There's lots of really great apps out there for picking colors and a lot of those apps aren't using these named colors like fissile or sienna or burnt orange or whatever. They're using hexadecimal. And so I prefer to work in hexadecimal. It's not as readable perhaps but it's a little bit more powerful in allowing you to pick colors that you like. You can see now our heat map goes from zero being white to red being the highest value probably these two squares for fermicides and bactroedetes. And so you really quickly see again one of the challenges with a heat map is that you know these four colors I think this this square here the bactroides sida fissile positive is darker than the other three around it but I'm not so sure. I'm also not confident that that's that different of a value than I have for fermicides healthy. Certainly you know the proteobacteria among healthy and then the verrucos and others among the three groups are really difficult to distinguish. And so perhaps this really only works best where you have big differences in relative abundance right. So like I know the bactroedetes and healthy is much higher than it seems to be for diarrhea seed of negative and seed of positive and that the fermicides is lower for healthy than it is I think for these other diarrhea samples. But again you know thinking of that perception that we perceive color depending on the other colors around it. So that's something that we want to keep in the back of our head as we go along and we'll see a strategy for how we can fix that before we're all done. Let's go ahead and remove these axes. And I can do that again by doing axis.line and I'll do element blank. And I'll also do axis.tix equals element blank. So that got rid of those. And so that looks pretty nice. We could maybe tighten up the distance between the labels and the heat map by coming back up to our scale fill gradient. And we could say then do expand equals C0 comma 0. We saw that previously with the stacked bar charts. And then up here we could do the same thing with our scale X discrete. Expand equals C0 comma 0. And so that makes it a little bit tighter of a visual. So this looks fairly decent. One thing that I'm kind of thinking about is that the squares in my heat map are pretty big. Another thing that I can add to this is a chord fixed. And in chord fixed I can give it ratio. And the ratio for chord fix because that's chord not core is one. So that the length of the y-axis or the increment of the y-axis is the same as the increment for the x-axis. So if I make it 2 then I believe I'm going to have taller squares or tiles than they are wide. And so yeah we see now that our tiles are taller than they are wide. I think I'd rather have them wider than they are tall. And so let's go ahead and do say 0.5. And so that's a little bit more truncated. Let's go down to say like 0.3. And I'm pretty happy with the way that looks. I have a lot of extra space now around my heat map. So I'm going to go ahead and shorten the height to 2 inches. And so yeah we now see that's a bit tighter and that looks pretty nice. I would like to change the appearance of my legend so that I have a 0 mark on my heat map. I can get that back up here again with scale fill gradient where I can do limits equal C0 to NA. And so now I have 0 at the bottom of my heat map range. One thing I would like to do is make the font on my legend here a little bit smaller. It's it's pretty big and pretty pronounced. It's just aesthetically I don't like that. And I would like to shrink it a little bit so it's not so tall and doesn't take up the full height of the figure. We'll come back here to legend text and legend key size. And I'm going to replace legend text with element text. And I will then say size equals 7. Instead of legend key size I'm going to do legend key height and unit 10 points. And so now we see that we have our legend going from 0 up to 50 or so. And we've got our title there and that looks pretty nice. So this is again a fairly simple heat map. One of the things again that I questioned was whether or not these four tiles are all that different in relative abundance. And actually if they're that different from the healthy formicities and again thinking about the Vruco Microbi and the others I can tell that the healthy is 0 or close to 0 for Vrucos. But what about these others? They're much variation there. So one strategy that people have talked about doing is actually putting the number in put the relative abundance in the different squares. I'll come back up to geom tile and after that I'll put geom text and I will do AES and for my label I will then say mean rel bond and let's add that to our pipeline. And we now see that we've got our numbers in there but there's a bazillion significant digits. So we need to do some formatting to clean that up. And I will do that here with round mean rel abundance and I will then say round to one significant digit. I also need to wrap this in a format function because if a number is rounded to say 3.0 then the round output will actually be 3 and it won't have that one significant digit to the right of the decimal place. So we'll do one our n small equals one there and let's make sure we've got all our parentheses in a row and we're good. Now our heat map has our rounded values in here and that's really nice. I'll call your attention to this 57.0 that if I hadn't used that format with n small equals one this would have been 57 without the period and the zero. And so definitely wanted that n small equals one with the format function. The need for the numbers kind of highlights a problem that I have with heat maps and that we don't do a great job of distinguishing between different colors. You know, I couldn't have told you that this tile the 27.1 was 10 percentage points less than this 36.5 tile and that it would have been 20 less than this 46.7 tile. Right. It's just not that possible to distinguish between the colors and I don't know that I could tell the difference between these two darker reds at 54 percent and 57 percent. And so you might say well who cares about a difference of three percentage points and so yeah, you're right. Right. And so that's why I go back and say that a heat map is probably most useful for qualitative comparisons not quantitative comparisons and also if I were redesigning this and thinking deeper about this you know I probably would pool together the Vruco Microbia and the others because you know the heat map on its own you can't distinguish what's going on with the Vruco Microbia. It's just too faint of colors. Also, if you have these numbers in the heat map then you don't need the legend right because it's the legend is doing no work here. Also if you've got the numbers then well why not put these labels on the x-axis up at the top and then you know what you've got you've got a table. So what's the point right. So I think this is like a real struggle. I really want to like heat maps because because they're kind of pretty and and who knows right. But to make them work you turn them into a table. So why right. Well let me let me go ahead and shrink the size of those numbers so you can at least see what that would look like. I'll go ahead and add size equals two and so those numbers are a bit smaller maybe that's too small but again you can play with the size of the font there by changing that number. Again, I'd prefer it without the geome text and again thinking about it in terms of a qualitative figure more than a quantitative figure. Okay, so this is looking at the four most abundant phyla with everything else kind of pulled together into another category. We've only been looking at phyla in the previous episodes. Let's go and see what things would look like if we actually looked at the genus level. To do that I'm going to come back up to the top here where we filtered on level equals phylum and I'll put genus and we'll see if that all just works. So that mostly works. We might need to do some finagling with the size of the document but we can we can see the makings of something here, right? And again this other category is pooling together those phyla where the highest relative abundance for any of the three disease status groups is less than three percent. So perhaps we would drop that down a bit but again this Clostridioides and Prevotella are pretty faint and if we went to a lower level it'd be even harder to see those. So again considering an Enterococcus and Fokkei whatever are so much darker we're not going to be able to see these rare populations very well. So again to make this look a little bit better let's go ahead and jack up our height to say five and so that that looks a lot better now already by changing the dimension of the figure we kind of stretch things out. Something you might think about doing is to kind of tidy up the labels along the x axis we could make it unclassified carriage return and then the name of the taxon and so back up here where we did that we did unclassified we could do Br. I'm going to leave that space in there because down here we're expecting a space or no spaces in names that we want to put italics around. So we'll go ahead and run that again. So this looks great we've got our taxonomic names for our different genera along the y axis and then our disease status groups across the x axis and we've got a fairly decent range for our relative abundance in the heat map something I'm very tempted to do is to move these labels to the top let's go back and do that and so in scale x discrete I'm going to go ahead and do position equals top so we see that our formatting isn't quite right we can come back to legend text x I'm going to try the top so now we've got our disease status groups across the top and our phyla our genera rather on the y axis with the different colors and again if you wanted you could go ahead and put in the numbers but then what you really have is a table so let me show you what that would look like so I will turn back on that geome text and I think I'll actually remove this size equals two and keep those fonts bigger the other thing then I will do is to do show dot legend equals false and so there's my heat map right and again it kind of looks like a table one final thing that I'll do just because I can't help myself is in this axis text x top I'm going to do vjust and I'll do 0.5 that way then healthy is vertically centered with the other two disease status groups so I think what we've got here is a pretty decent heat map and we've done a pretty good job of showing the strengths and weaknesses of the data visualization approach a couple of things that I want to highlight that we've been able to avoid in our visual here so first of all we don't have a bazillion different taxa or a ton of different samples a lot of heat maps that I see out there in papers have just a ton of rows or columns and makes it really difficult to scan across a row and to understand what I'm supposed to be getting out of that heat map I think by focusing in on I don't know is this maybe a dozen different genera or for phyla in the other case then we're really kind of focusing our attention also we only have three categories to some extent that's also a downside of this medium that we also saw with stacked bar charts and pie charts is that we can't really get a sense of the variation in the data and so we are collapsing a lot of information together maybe we'll come back to this later when we go about looking at individual observations or subjects in each of the three disease status groups another thing I want to call your attention to is that we have a monochromatic legend we go from white to red right we're not going from blue to white to red that's a bad habit that I think previous versions of heat map software really set people up to fall into that trap we want to be monochromatic increasing I think previously a lot of heat maps are used to going from the minimum being say blue and the maximum being red and in between there'd be white and so you'd see white and you think oh that's a low value right and so then what's blue and red and how do those relate and it just causes a lot of confusion because I think instinctively we're used to seeing heat maps where we perhaps have negative values so we have reduction in say gene expression or something like that that is blue it's a cold color and we have increase in gene expression or increase in relative amounts that's red right and so to throw white in the middle there and when that white doesn't correspond to zero then there's big problems there are ways to make those types of heat maps in R but we're not going to go into that because it doesn't really fit with our example in terms of picking a color palette for your heat maps is again have that be monochromatic I think we fell trapped to this as well in the paper that these data were originally drawn from and that we used a rainbow palette we have red orange yellow green blue purple right and so who can keep track of what that order is again keep things monochromatic no rainbows no blue to white to red across the zero to one continuum anyway I hope you find this useful I really do want to like heat maps I think generally they're pretty but again I think they're really only useful on a qualitative scale not on a quantitative scale let me know what you think down below in the comments if you agree with me or not I think I think they're a useful tool but perhaps they're not as useful as I think a lot of people who have been sending me messages think they are so hopefully they'll respond and let me know why I'm wrong or what they think anyway we're going to keep progressing on and we'll see other ways of visualizing these data hopefully with each step you can appreciate you know the strength and weaknesses of the different visualization tool of the different chart types and and realize that you know for different contexts and different types of data a heat map might be excellent right but in other situations like perhaps this it's it's not so valuable anyway again let me know what you think down below in the comments if you like this type of material also down below in the in the description are some links to some tutorials that I've created we don't get into heat maps into those tutorials but there's other stuff in there about dplyr ggplot things like that look for the link for minimal r also those are the materials that I use for three-day r workshop that I teach I also teach a general r workshop that's three days that doesn't cover microbial ecology data but covers the same concepts using other data sets so be sure to check those out if you're interested in learning more about r and and like what we've been covering here anyway please tell your friends about code club be sure that you're subscribed and click that bell so you know when the next episode drops please click the thumbs up so you know that you appreciated this even if maybe you don't agree with my conclusion and we'll see you next time for another episode of code club