 Have you ever done something silly like go back and read a journal that you wrote when you're in high school and cringed at like how melodramatic you were or how poor of a writer you were? Well, have you ever looked at your old science papers and kind of cringed at the analysis you did or at the figures you made? Because you're judging what you did back then with what you know now. It's really not fair to, you know, five years ago you. It's also not fair to the people in your lab that join now because I frankly expect them to learn all the stuff, all the mistakes that we've made over the past 15, 20 years and not make them now. But of course, as we go forward 15 years, we'll look back on today and be like, man, what was Pat thinking? What were we thinking in that analysis? We should have known better. Well, we're always trying to get better. And that's why I tell people, don't compete with, you know, Joe Schmoe, compete with patch loss, compete with yourself. Don't compete with me. Compete with yourself and always trying to be getting better. Well, today we'll take a trip down memory lane and we'll look at an old figure from one of my papers and we'll critique it and we'll see if we can't make it better. Hey, folks, I'm Pat Schloss and this is Code Club. As you know, I'm a professor at the University of Michigan. I've been running this lab at U of M for the past 11 years. And before that, I was at UMass Amherst. And before that, I was a trainee and did all sorts of crazy things. Right. Well, over time we learn, right? We learn, we get better and we try not to repeat the mistakes of the past. But I think it's healthy to go back and look at what we did in the past and see how much we've grown and perhaps how would we do things differently today? As I mentioned, I really feel bad for the people that joined my lab today because we are so much further ahead of where we were five, 10 years ago. And we were pretty awesome five or 10 years ago. Right. But we're always getting better. What we've been doing in recent episodes of Code Club is looking at supplementary figure one from a paper that a former student in my lab, Alexandria Schubert wrote was an ordination that we kind of just threw in the supplement because we didn't want to make it, but a reviewer asked for it. So we made it and we we just kind of did whatever they asked. Right. Well, figure one is prime real estate in a paper and you'd hope you do a good job of it. Well, we've been trying to take that ordination in the supplement and add some of that information from figure one, namely the diversity information to see if we could map diversity information onto different elements of an ordination diagram. And we found it didn't quite work. So what I'd like to do today is go back to figure one from Alex's paper to critique it and to see what works, what doesn't work and what we can do better. So here is figure one a from our paper or figure one, the whole thing. Right. And what we are doing in this paper is looking for biomarkers that would allow us to differentiate between people that had that were healthy and they've had solid fecal material, solid stools, people with diarrhea. And then people with diarrhea who also had C. difficile. You will not get tested for C. difficile unless you also have diarrhea. So there's a disease progression here. Right. And one of the things right off the bat that I really like about this figure is that it does one thing. Right. And I think it does it fairly well in that it shows that non-diarrheal controls, healthy people have higher diversity in their fecal communities than people with diarrhea or people that have C. difficile. OK, so I like that. I like the simplicity and I really commend past the Schloss lab for getting that right. Now, what don't I like about this? Well, kind of taking in the whole figure, we'll talk about these later, but these are receiver operator characteristic curves that show the ability to differentiate between these three groups of patients based on the structure of the gut microbiota. We don't need to go too far into this. But one thing that right off the bat, you'll perhaps notice is that we did not follow the Z pattern, right? We go A, C, B, D. No, A, B, C, D. That that's like a backward end, right? That doesn't work, right? We should have flipped that, whatever. It's OK. Now, to figure one A right off the bat, I'm noticing that we don't have the bars in order of the disease progression. We go case, diarrhea, non-diarrhea. It would be better to go non-diarrhea, diarrhea case because that's the order of the disease progression. At least in my current 2021 view of the world, that's what I think we should do. Another thing is that these bars don't allow us to see how much data are there, right? How many points are here? You know, perhaps you could put an N next to the case or diarrhea controller, non-diarrhea control to say that there's so many patients represented in this bar. And the legend tells you that this is the mean and the standard error. As you recall from previous episodes where we've looked at the inverse Simpson diversity for these samples, there's huge variation in these data. And the standard errors make them look really, really small. There's actually a really nice plus paper that was published in plus biology. I believe I'll link to it down below in the notes that that looked at different ways, different distributions of data that will give you the same mean, the same standard error, perhaps the same statistics, and they're quite varied. And the point is that bar plots are actually a pretty crappy way of showing continuous data. Perhaps they work well for counts, like if I wanted to show how many people were in each of the three groups, I might have a bar for each of those three groups. But for data where we've got a bar representing dozens or hundreds of points or even, you know, five points, they really leave a lot to be desired. And so maybe not in this episode, but in future episodes, we're going to dig into how can we show more of the distribution of the data? And so I think that's a kind of a fundamental flaw, actually, in this figure is that we used bars plus or minus the standard error. The problem with the standard error as well as that it portrays the distribution, the variation is being symmetrical, and I'm pretty sure it's not symmetrical. So I'll show you how to build this plot today. But going forward, I'll show you better ways of doing it. Some other problems with this, the access labels are at an angle. So I have to crane my neck and my neck is getting stiff looking at this. I guess to you, you turn it this way. Anyway, you know, putting this on an angle, I know what she's trying to do is that we're trying to fit more information in there without having it be totally vertical or be horizontal and have the titles run into each other. I think the better, better design choice would have been to break up diarrheal control, perhaps onto two lines, non diarrheal control into two lines or perhaps take the whole plot and shift it 90 degrees so that you could then have those titles red, left to right. Another challenge is that the colors that we actually used in the supplement were gray, blue and red. Here we have black, light gray, dark gray. The color scheme doesn't really work. Also, we put four stars in here, I think to indicate this is super significant. But really, significance is a yes, no thing. Is it less than 0.05 or not? If it's less than 0.05, we put a star. If it's not, as she said, we could do non-significant or just not even mention it. OK, I think that's beating up on this plot enough. Let's go ahead and see if we can create this plot in R and then we'll critique it and see what we might do in the next iteration. I've got a starting chunk of code here. If you want to get this code so you can follow along down below in the notes is a link to a blog post where you can grab this code and you can follow along. Also, I have a previous video that explains how to install our studio, Tidyverse. You'll also need to go ahead and grab a GG text. Otherwise, glue, read, excel, Tidyverse, they're all part of the Tidyverse package. What we're doing here is reading in metadata, reading in information about alpha diversity and joining that all together. I also have a color scheme that I've been using in past episodes. I'll go ahead and load all that. What we need to do next is to take metadata alpha and we need to summarize the inverse Simpson diversity across our three disease statuses. So I'll take metadata alpha and we'll group by disease stat. And we will do a summarize and I will do mean and this will be mean inverse in Simpson. And let's go ahead and make sure that works. Oh, it's not happy because I didn't run the library. So I've got to run everything. So it's complaining because I have a stray greater than sign. Let me go ahead and rerun that. There you go. So we now have the mean for non-diarrheal control, diarrheal control and case. Something I want to point out is that up here I defined a factor and I've been using this in past episodes to make sure that my disease statuses are in order from non-diarrheal control to diarrheal control and case. So that should correct that ordering problem that we saw in the paper. All right. So the next thing I'd like to get is the standard error. And that is going to be the SD, the standard deviation of inverse Simpson, divided by the square root SQRT of the N. So you can use the N function to count the number of observations in each category. And we can get rid of that annoying summary statement here by doing dot groups equals drop, run that and we're good. Now, because I want the cap to be a max and min, I'm going to go ahead and do max equals mean plus SE and min equals mean minus SE. Run that and now we see that we've got those columns. So we will be in good shape and this is what we're going to plot. I'm going to call this inverse Simpson summary. And so we'll do inverse Simpson summary and I'll go ahead and pipe that to ggplot aes and then across the x-axis. Again, that's our first pre-attentive attribute that our eyes as readers are going to notice first. We're going to put disease stat as x and then y is going to be mean for the mean, the height of the column. And let's go ahead and do that. And then we will do geom underscore a call for it to make a column plot or a bar plot. We'll do that. And let's go ahead and also do ggplot or sorry, ggsave and I will call this Schubert diversity dot tiff and I'll do width equals 4.5 height equals 3.5. Again, I like to be specific in outputting my figures. I don't want to depend on our studio's plot tab. So I've got my start. I've got my non-diarrheal control on the left. Diarrheal control case. So we've got the disease progression. Again, if I look at the Schubert figure one, it's the opposite, but those heights look pretty good. I'm so I'm happy with that. And so let's go about fixing this up a bit. So first let's let's go ahead and change our theme and let's fix our labels. I don't need an X axis label disease stat because it's clear that those are the disease statuses and the Y axis is going to be the inverse Simpson index. So I can do labs X equals null. And so that will not put anything on the X axis as a title. Y will be inverse Simpson index. Good. And we'll do the theme classic. And so now this is looking better already. We've got the clean background. We've gotten rid of that X axis title and we have inverse Simpson index on the Y axis. Now these titles are not what we want. That is what we used in the paper, but it's not so clean or intuitive. We will do scale X discreet. And for our breaks, we're going to use those levels that we use to define the factor way up here. So I'm going to grab this vector and put this in as the value for breaks. And I think I can put this all on one line. Yeah. And then my labels, I'm going to do healthy. And I will do diarrhea. And maybe I'll say without C difficile or diarrhea and C difficile negative. Maybe that's better. And then we'll also do diarrhea and C difficile positive. And I'm going to go ahead and put stars around C difficile because we're going to use GG text to make that italicized. And we'll also, I think, put in a line break here. Again, GG text uses markdown and HTML to render these. We have to tell it to, but we'll get it done. So BR is the line break in HTML. So if we go ahead and run this, I screwed something up. I forgot a plus sign here. And so we see that our axis labels are all messed up. And so to fix that, we need to do theme. And we will then in theme, we'll do axis.text.x is element markdown. And voila, we've got the line break and we've got the italization of C difficile. Looks great. I'm happy with that. OK, now I forgot we need to also put on the caps for our confidence interval for the plus or minus the standard error. So we'll do a geom error bar. Now, error bar needs to aesthetic. It needs it needs Y min. And it needs Y max. And so here I'm going to put in min and max, the min and max columns from the data frame that I defined up here. And and what we find is that our error bars are as wide as the column. They're also on front of the column. And I would like that to behind the column. I don't really need to see the bottom cap of the error bar. So I can use the width parameter on error bar. So I'll do width equals 0.5. And I'm also going to move this ahead of geom call. So you can think of a drawing the error bar before it draws the column. And so that makes it more narrow and it hides the bottom cap. And so if we look at what we published, this actually looks more in line with what we had. Again, the ordering is different. Let's go ahead and change the color. And similar to scale X discrete, we can do scale color manual. And we can do name equals null. And we will do breaks. And I'm going to just copy breaks and labels here down. And we'll go ahead and do values equals. And this is going to be a vector that will be healthy color. Diarrhea color and then case color. Good. Let's give that a run. And of course, I forgot the plus sign here. So that didn't do anything. What happened? Ah, so it didn't do anything because I didn't map anything to color. And so I'll do color equals disease stat. So some of you might be yelling at me because you'll notice I screwed something up. So color is the color of the border of the bar. What I meant to do was fill. And so I need this to be fill equals disease stat. So I want the fill color of the bar and then also here. I want that to be fill manual. So now you'll see that we have our bars colored by disease status. We also have this big honkin legend. I do not think we need a legend to say what the X axis labels tell us. So I'm going to go ahead and get rid of that legend. And we can do that as we've seen before by doing show dot legend equals false. Let's take a few moments and critique this figure for what's good and what we still don't like. Just at the beginning, let me be clear that I would not publish this figure today. I would not show it in a talk knowing what we know about bar graphs and how they hide information. They hide information about the number of subjects that were go into each of these bars. They also hide information about the variation in the data. We saw before that the healthy I think the maximum value we saw for inverse Simpson was actually 30 and the smallest values for some of these went down to one. So again, we're not seeing the true distribution or the depiction of the data. So what do I like? Well, I do like now that we've got the order of the disease progression going left to right. I think it's very clear that the inverse Simpson diversity of healthy people is larger than it is for people with diarrhea and for people with that are positive for C. difficile up. I need to fix that negative to be positive. And and I think the coloring works because it matches the color scheme that we had previously. Let me go ahead and fix a couple of things that maybe we could do to make this better. So first of all, let's make this positive and something I might do is up here in summarize. Let's go ahead and put in a column for the end. And so I want to get a healthy end. And that will be. Inverse Simpson summary filter. On disease stat equals that was non-diarrheal control, right? So non-diarrheal control and we'll then pull N. So double check that this works healthy and sorry. So 155. And what I want to do is create this for the three groups. So this will be diarrhea N and this will then be diarrheal control. And this will be then case N and this will then will be case. So let's make sure we've got these all loaded. And now we've got glue installed and loaded. Let's go ahead and in our scale X discrete for our labels. For our labels, let's go ahead and put in our, our ends, right? So we'll do N in parentheses. I'll do N equals and then in curly braces, healthy N. Right. And maybe I'll copy this for the other two cases as well. And this was diarrhea N and then this was case N. And because we're working with glue, we can use those backslashes to make some extra line breaks for us to make it easier to see what's going on. So let me put glue around each of these separately. So glue here and here. All right. And then we'll do glue there. I just kind of cleaning stuff up here a little bit there and then glue on that. And that, yeah. And so now we have an N under each of the bars. And so again, it's not perfect. The bars are still hiding the points, the distribution of the points, but at least by putting in the N below the title for each of the variables, each of the disease statuses. We get a sense of how many points are being represented by each bar. Again, I wouldn't, I wouldn't use this plot today. I would use other approaches that we're going to be talking about in future episodes. So that's a reminder that you should be sure that you're subscribed to this channel so that you can see what comes next. Be sure you hit that bell icon. So you're like, you're there when I post the next video and you can see, well, what does Pat suggest using to look at this data in a better way that shows the distribution of the data. Anyway, I hope you can play around with your own data. I hope you can look at your old figures, your own papers and maybe your old old journals and writing and not cringe, but see that as an opportunity to learn and to grow and to find a way to be better. I think we're all trying to grow and we're all trying to do better. And that's really one of my goals here with the Code Club episodes and then trying to critique some of our old work. If you would like to help have me help you grow by giving you a gentle critique of your visuals, by all means, let me know. I'd love to talk through with you some of your design choices, your constraints that you were under as you went about making your plot and what you were trying to do and see if I can't help you figure out how to make things better. Anyway, please tell your friends about these episodes of Code Club and we'll see you next time.