 Have you had that experience where you're reading a scientific paper and you look at some of the figures and think Man, I wish they would have done it this way or I wonder why they didn't do it this other way Uh, it's the same kind of figure in every paper. Why do they keep doing that in all these papers? Uh, and and perhaps worse You look at your own papers and you think why did I do that or you know? My skills just aren't good enough to do this other thing. I'd really like to do well Those reactions are all a critique that you're forming about the figures that you're looking at in Today's episode of code club We're gonna critique a figure that I made in a paper that I published a number of years ago And we'll see if we can go about making it better. Hey folks I'm Pat Schloss a number of years ago a former graduate student of mine Alex Schubert and I Put together a manuscript that we tried to get published in the journal M bio that used statistical models to diagnose individuals as whether or not they had cluster deoidies difficile infections based on the structure of the microbial community in a fecal sample as we were going through the review process again There was a lot of kind of statistical models and machine learning and kind of jargony wonky stuff The second reviewer I'll remember always remember said I don't want you to dumb it down But could you perhaps put in another figure that would be a little bit more accessible to clinicians? And we thought that was pretty funny. And so we said well, I guess we have to put in an ordination diagram Ordinations PC oA's and MDS's Are kind of a scatterplot of microbial ecology data that it seems like you just have to have one of these in a microbiome paper these days Anyway, we begrudgingly put one in the supplement I've never really been happy with how that figure turned out if we go to that manuscript here on the M bio website Yeah, it's supplemental figure one Way down here You can look at this Let me leave the legend there for you to look at And so here is what the figure looks like it's an NMDS And and I don't know if anything strikes you about this. It's pretty nondescript We see the non-diarrheal controls clustering on the left Whereas the cases and diarrheal controls cluster on the right the way this works For diagnosing C difficile infection If you have a normal formed stool sample You cannot get those tested for C difficile You have to have diarrhea and so these non-diarrheal controls were people that had normally formed stool Whereas the case and the diarrheal controls were both people sets of people that had diarrhea And so you can see those red and blue points They look really similar to each other whereas people with normally formed stool is on the left Okay, so that was a lot of description right that I just went through to explain this plot to you And so maybe that's a problem with this with this plot, right? It's not immediately intuitive What the story is that I'm trying to tell and and if you look at the caption down here at the bottom It's not immediately clear what you should be getting from this figure either, right? And so I think about this figure a lot probably more than I should for a paper That's about seven years old now And and I think about you know that we're always trying to get better in how we display our data And how we try to make our data accessible to other people and so I've been I've been making a list of Notes for a data visualization rubric that I would like to develop that Myself or people in my lab or you could use to evaluate Data visualizations that we have in the scientific literature now It's not like you get like a gold star for everything and you win And and you're not gonna be perfect on everything because we have constraints, right? There's there's things that work or don't work With everything so for example, I might want this to be a 3d ordination To really show off the variation the data, but it's gonna be published in a PDF. So it's a 2d medium I can't make it 3d right so there's limitations that are built in there's also technical limitations, right? Like I might not know how to make a 3d plot or I might not know how to make a plot like this, right? And so those are technical limitations that are built in so where I'm going with this is Over future episodes of code club. I want to work with you to develop this type of rubric And evaluate plots and then use r to make the plot and then use r to make the plot better Okay, so I'm gonna go first With some of my plots from my research and some of my ideas and maybe use trends Plots that are emblematic of trends that I see in the literature But hey, if you've got a figure from your own work that you'd like me to take a look at and have a discussion with you Let me know put a comment down below email us at riffamonus at gmail.com and We'd love to get you in the queue to talk about your work. I promise I will be nurturing and supportive And we will not rip people apart. So anyway, let me tell you about this rubric that I've got so far It's all handwritten so far because I'm not really ready to commit it to stone Anyway, so like first things like I want it to be an attractive plot. Is this attractive not really The primary question and answer is clear doesn't tell a story Not really right like we already kind of talked about that What is the level of effort required by the reader to understand what was going on? Well, you know I had to go through that couple minute ramble Describing what was going on here? It's a declarative title. No, there's not a declarative title here It has the audience in mind and has some empathy, you know I think I was trying to be empathetic right when we kind of threw the reviewer a bone by making this figure But you know, maybe we weren't as empathetic as we could be that there's more we could do to highlight What's going on in this figure? Maximize the data to pixel ratio I will talk more about this later, but I think we're doing pretty well here There's it's a pretty pretty minimalistic plot that conveys the information Consistent fonts clear typography size and bolding. I think it's pretty good One thing that kind of struck struck me when I opened this again for the first time in a while was that the The access labels are pretty big compared to the legend label And so the proportions there seem a little bit a little bit off Good composition of multi panel this isn't multi-paneled Selection of pre-attentive attributes is optimized based on variable type well pretty what pretend of what we'll talk about that later Clear access values and ranges Not so much right like because we don't really bound The range on the y-axis or on the x-axis Colorblind friendly. Yeah, it's but these colors are pretty safe Less than five colors shapes or line types. Definitely. We have we have three colors one shape Evidence that the tool overpowered the designer. No, I don't think so I think I think we know exactly what we're doing in making kind of an abysmal plot Is there evidence that the designer was limited by their tool? Which is probably a sign that we're using the wrong plot type. No, I think that I think we're good there So again, this is my first draft at a rubric and you can kind of see my thoughts as I went through Building out this plot. So whenever we build a figure We always have some type of constraints and one of the constraints that I think will come back to repeatedly with a diagram like this Is that it we're working within a two-dimensional medium? I could make a 3d plot and maybe we'll do that in a future episode But I still can only see in two dimensions here The other thing is that another constraint is that it's going into a scientific publication So scientific publications don't typically have a title across the top of the plot. So again, that's a constraint maybe if I were presenting this in a talk I Could have a title and I might fashion this a little bit different because again the constraints have changed between a paper and Presenting this as part of a talk one other limitation or constraint that we're imposed with this data is that these are really complicated data These are the distances between the points are described by some metric of community dissimilarity Which is kind of an abstract heady topic That's difficult to communicate to a lay audience and sometimes to even other scientists. Okay, so that's a lot We're not gonna get through all of that today by any means But we're gonna take a bunch of episodes to go through some of these topics with this plot And then we'll like I said try it with some other data from my lab and perhaps data that you know You want me to take a look at and we can have a discussion about there So the first thing I would like to do is use are to regenerate this plot So this plot as you might be able to tell was generated in our but was generated using base our graphics And so I'm not gonna do it in base R because I've to be honest I've been using tidyverse for so long and gg plot 2 for so long I kind of forget how to make a scatter plot in base R. That's kind of sad to say I was such a long holdout of base R that I have some code in here Hopefully you saw the previous episode about getting set up in our studio and getting the data files that we're gonna be working with I'll go ahead and highlight everything here and hit run to load everything so that I now have this metadata NMDS data frame and What we will see is that we have axis 1 and axis 2 are our 2 axes for our NMDS And maybe in a future episode we'll talk about why NMDS and why not PCO a but there's only so much We can do in one episode, right? All right, so to build this plot what we can do I'm gonna do gg plot and then we'll do metadata NMDS and Then our aesthetics what we want. I'm what variables we want to map to different components of the plot So x will be axis 1 y will be axis 2 and then the color I'm going to map to What was it? disease stat this variable here and Then we'll do a geome point So let's go ahead and run those and see what we get And we get something that doesn't look very good Um, we go ahead and make this a little bit bigger So it got great and I hit that refresh button. Everything is good now. Well good ish, right? All right Maybe I'll try to make this a little bit bigger. So it's easier to see and so What we see is that we've got non-diarrheal control in blue and that's at the bottom Green are the cases and the red are the diarrheal controls So with NMDS the axes aren't really constrained So if you think about this other plot that we had that we published The grays are the non-diarrheal controls and they're over on the left So basically think about this plot that I just generated and turning at 90 degrees And I think it'll look pretty close. There's a couple differences between what I'm using now Versus what we used way back when so the distances I'm projecting into the NMDS now Are break artist distances previously we used theta yc I think they're pretty close for our purposes. Okay Good, so let's go ahead and clean this up a bit to see if we can get it looking more like that previous plot we generated So one of the first things that we can do is let's change the labels So do labs and we can then say x equals NMDS axis one y equals NMDS axis two Again previously we had axis one and axis two I think putting in NMDS axis one NMDS axis two makes it clear again. What type of ordination we used I'd also like to have a white background a clear background Aesthetically again that that data to pixel ratio and to get that we can do a theme classic And that kind of cleans up the appearance. We don't have a full box around the plotting window, but it's okay Um, let's go ahead and change these colors because the defaults From ggplot are not friendly to our friends who are red green color deficient Um, and so we can do um, I like to put this in between like my labels perhaps and my theme We'll do scale Color manual and then the name. Um, I don't need to put disease stat in here. So we'll do name So we'll do name equals null We'll do breaks equals And this can be a vector that we will do a case Um, or let's do it in order. I have of kind of the way I'm thinking through things. So I will do non diaryl control and then we'll do diaryl controls and case That's good And then we need to add values And those are going to be our colors. So the non diaryl control will be gray Alex made the cases red And the diaryl controls blue. So we'll do diaryl. So this will be blue and this will be red Good and then our labels I'm going to Grab this vector and copy it down And for now, I'm going to put in a hyphen there in lowercase the d And put a space in but non diaryl control diaryl control case And let's get my parentheses straight And we'll put a plus there and bring up the theme classic give that a rip make sure everything works And we're in good shape. Nice And so again These titles i'm not totally crazy about because It's not immediately clear what what we looked at, right? And so maybe what i'll do instead of non diaryl control will be to do Healthy and then we'll say diaryl C difficile negative And then we can do instead of case let's do the same thing but we'll say C difficile positive Give that a rip and so we get the same idea It's a really long really long name. Maybe what we could do is we could say put in a backslash. What does that look like? That's okay Not ideal. Maybe that's something that we'll think about as we iterate over future versions of this plot Yeah, that doesn't look great either one of the things i'm not really liking Is that i've got so much space to the right side of my plot here devoted to this legend And it's just not it's just not good, right? So it'd be nice if there was a way to move this legend Inside the plotting window But but we can't another thing i don't like is that the c difficile is vertical Or normal font. It's not italicized. It would be nice for that to be italicized You know what? Maybe i will go ahead and remove the diaryl So i think what i'm kind of going through here with you is that There are constraints right there. There are challenges that we're running into As we do this. So, you know, maybe we could do c difficile neg C difficile pause Right that kind of tidies that up, but at the same time it kind of Obfuscates what we're really trying to get at so i think i'll go ahead and leave in positive and negative and Give that a run and get that right back to the way it looked And we compare it to what we had published previously one of the things i notice is that the The the range on axis one is the same as the range on axis two And i can get that look by adding chord underscore fixed And this will make the the um the spatial variation on the y axis and the x axis the same Let me show you what that means. So again, I have The same distance between like, you know, this is like minus 0.6 and 0.8 here And so that variation is the same as the variation here And so It gives us a square and what we see is that we have a nice round circle So one final thing that maybe i'll do would be to do gg save and we will output this as Schubert that schubert underscore nmds dot Tiff and so there you go. There's my schubert nmds From 2021 and here's my schubert Ordination from 2014 or so And I think you can hopefully see the similarities one of the other things I did was I reordered the the diagnoses groups here and so um I like having healthy first and then negative cd of negative and cd of positive because again that shows um kind of the the progression of the disease, you know, maybe we could even do healthy um diarrhea Cd of positive. So if we did diarrhea And then be sure to run that gg save And we again see healthy diarrhea cd of positive. Yeah, that works. That works pretty well. Um Something that we'll come back to and talk about is how can we italicize cd of a seal positive One other thing I might think about is how can I shrink the space between these three legend items? They seem pretty wide apart um And so if I do question mark theme In my help over here This will bring up a whole bunch of options of things I can look at and so what I'm looking at is the legend And so legend space. I've been through this and I'm sure it's actually the legend key height Is what we want to look at And so let's scroll down to where it describes the arguments for legend key height And here we are key height size of legend keys in unit um Great, so we give it a unit using the unit function. Uh, let's add that up here And we'll do a theme and then in here we'll do legend key height and let's do unit um, I think One Let's do cm One centimeter. Yeah, let's see what it looks like right good. It's a starting point, right? So let's come back to our code and we can then say let's do 0.2 centimeters and that looks More compact maybe a little bit too compact. I feel like this is kind of squished together I think it's probably squished together up here because we didn't want to get in the way of that point So let's go ahead and Maybe do 0.25 that looks like it's got a little bit more breathing room Now one other thing we could do is let's see if we can move this legend to the bottom right corner here And let's see how that works. So another thing we can do in the theme function Is legend position Let's see that's probably down here further so legend position And so we can give it a vector of different elements so we can do legend dot position And we can then say c So over to the x let's do 0.9 And then the y let's do 0.1 Give that all a run. Uh, and so that doesn't look great, but let's see what things look like in the tiff because again Things always change when you put it It when you when you render it with it with the tiff So this actually shows me that the background is white Instead of transparent. So that means that we need to modify that to do legend background Yeah, the background. Um, and then we can do element Wrecked and we can do color equals na And that should then make the background image clear But it didn't So let's me look back here elements legend background Let's see element wrecked. Uh, let's make this Uh, let's do element blank. That's good enough Uh, I think that gets what we wanted And we see that the challenge here is that Uh that we've got overlap and we don't know if this point is part of the overall ordination So let's move this to the right a little bit And then we can put a rectangle around it and see if that might make it a little bit better So again to the right a little bit. Let's do point nine five And give that a run Yeah, so that bumped it to the right pretty nicely again. Let's Look back at that help for the legend And maybe we do want that element wrecked after all Um, I think I knew what I screwed up before so element wrecked and I used fill and fill is the I know he's element. I used color and color is the color of the border I think what I wanted was fill equals na Let's look at that so that worked But we then need to look at the border and again element wrecked. Um We can let's do line type Or let's do color equals black because I suspect there's already a line there So run that and I think we're gonna have a pretty hideous block black Box around it. Uh the margin is probably too big So let's see if we can shrink that down a little bit So there's a large legend margin here. So let's go ahead and try to modify that And we'll do legend margin And that takes the margin argument And let's see So we give it top right bottom left And so let's do top zero r zero b zero l zero That looks good ish There's still a bit of a gap up at the top here. Um, and there's too short on the bottom So I think I would feel okay If I had a little bit more on the bottom and a little bit more on the right So let's do bottom left right one See what that looks like It does give us a little bit more spacing. I'm noticing um bottom two right two That looks like decent spacing still have a fair amount of spacing up at the top there So I'm noticing a few things I still don't like is that the the margin between the point and the text is pretty wide There's also a little bit unevenness above here So I think what I actually want to do is instead of key height I want to make legend key size Square so I want it to be a quarter of a centimeter tall and a quarter of a centimeter wide Let's run that and see if it pulls it together a little bit. It looks like that might have helped I have to say after gg save it So that seems to pull it together a little bit um And then let's work on this margin to see if we can maybe let's Let's try two all the way around there That looks okay. It's a little bit cramped So let's maybe put it up to a three And that looks pretty good There's still a little bit of an extra header space above the title there that I don't like Let me see if I can put in minus one. Let's do minus two Uh, see if we get a a minus margin and that looks good. I like that Um, one thing I might rather do actually is put that up at the top To have it in the upper right corner Again at this point, uh, you're just kind of fiddling with things I kind of like having it in the upper right because as I see the figure for the first time My eye naturally or eyes naturally kind of read in a z pattern And so if my eye starts at the top and looks to the right for and sees the legend Then when it comes down across The image I can better interpret what's going on. So let's let's maybe put it up there. Let's do 0.9 0.95 And that looks Really decent um You know, I'm I'm I'm fairly happy with that. Um, I could keep picking at that But that's why we're going to come back and we're going to look at other Ways to play with this figure in future episodes of code club. All right. Again I feel like we've made some progress in taking this figure from my paper from many years ago And modifying it to be a little bit more attractive um You know, perhaps we haven't made that big of gains But we have reproduced it and we've added a little bit more information on the axes The the size of the margin of the of the font is a little bit more proportional um And and we've got something now that we can work with going forward again thinking about That rubric that I laid out at the beginning of today's episode I think that we've already made some improvements by being more descriptive about the axes titles um Giving more clarity about what we're looking at here. So instead of saying case we have C difficile positive Instead of non-diarrheal control. We have healthy right instead of diarrheal control. We have diarrhea It's not perfect. But again, we've got constraints, right? We're always going to have constraints when we're working with data and how we display them to the community And again the constraint that i'm working with here is trying to make a figure that would look attractive within a scientific publication and as I talked about Before if we look back at the legend for this paper for this figure It's pretty blah pretty descriptive generic doesn't say anything if I were to write this again it would say something like you know the the microbial community structure of healthy individuals Is significantly different than the community structure of people with diarrhea and those people with C difficile infections, right? But we don't know that for sure until we do a statistical test So we'll have to be sure to figure out how we can do a statistical test within our to build that up All right, so I hope you found this useful. We're going to use this as a launching pad for future episodes So all the code that I've written today I'm going to be sure to put in the notes that are linked below this episode here on youtube Again, as I said at the beginning if you've got figures That you're wanting to know someone else's take on and and what I might do to improve them or what I like or don't like about them By all means let me know Perhaps I can have guests on Here and we can we can share together what we like and don't like about each other's graphs And all with the hope of trying to make them better and trying to make it easier to communicate our science to others All right We'll keep practicing playing with these different techniques and thinking critically about how we present our data And we'll see you next time for another episode of code club