 One of the more popular approaches to visualizing data is with what's called a line plot. In today's episode of Code Club, I'll show you how we can use ggplot2 in R to make line plots look more attractive. Hey, folks, I'm Pat Schloss and this is Code Club. In the last episode, we spent a fair amount of time going through building functions and running code to generate data for receiver operator characteristic curve, also called a rock curve. Our rock curve is a way of looking at the ability of a marker or a diagnostic tool to classify individuals into different disease statuses. We were using the inverse Simpson value. So if you have a higher inverse Simpson value, then that indicated that you are more likely to be healthy. Anyway, we generated these curves. These curves are line plots. We spent a lot of time building functions and talking about dry code. And the plot at the end is functional, it works, it shows us the results. But it's not really attractive. So in today's episode, we're going to spend a fair amount of time going through different options that we have to make line plots look better, using the geom line function from ggplot2 in R. Before we get going here in our studio, I want to remind you that down below, there's a link to a blog post that includes the code that I am starting with here today. Also along the top here is a link to a video that I made about installing and getting everything set up with R in our studio and the tidy version getting the data that I'm working with. Also, if you really like these tutorials and but you still feel like I'm covering a lot of new territory and kind of more fundamental aspects of R and the tidy verse, be sure to check the links down below where you'll go to the Riffamonus website. There I have online tutorial for free. I also have links there where you can sign up to take a workshop that I teach. I teach three day workshops about learning R through the tidy verse using both microbiome data like this, as well as data sets that have nothing to do with microbiome. There's a lot of code here that I don't really want to go through about building the rock curve. We talked last time about building functions and running those functions. And so that then results in this data frame called rock data that we ultimately we use ggplot to plot the data. So I'm going to go ahead and run everything, make sure it all works. Again, here are our rock curves that came out of the code that we generated in the last episode. You'll see the various comparisons here. Again, diarrheal control are people that have diarrhea but are negative for C difficile infection. Case are people that have diarrhea and have C difficile. We also have non diarrheal control, and these are people that we might consider healthy. We could clean this up a little bit, but again, for the focus of this episode, what I really want to focus on are the lines and maybe we'll come back to those comparisons and making the legend look a little bit nicer towards the end of the episode. So one thing about this plot that doesn't look so great, if I were to say project this figure for a presentation is that the lines are a little bit thin. So I'd like to thicken the lines to make them pop a little bit better. One of the things we can give geomline as an argument is the size argument. So the size for geomline is going to affect the thickness of that line. The size parameter argument you'll remember with geompoint affected the size of the plotting symbol. But again, with geomline, size is going to be the thickness of the line. If I do size equals two, we get a pretty thick line. I think that's maybe a little bit too thick. It's almost cartoonish. It's like someone drew with a magic marker. Let's go ahead and knock that down to one. And so yeah, that doesn't look so cartoonish. And it's still a pretty bold and pronounced color without being so big and not too thin so that it's hard to see the line. The other thing people like to change for a line plot is the type of line being plotted. We've been plotting a solid line. We could also add an argument line type to get different types of lines. So one is solid. I could also do two, a dashed line. I could also do three to get a dotted line and four to get a dashed and dotted line. This looks a little bit too pixelated for me. So I'm not such a fan. One of the things I want to call your attention to is that for both size and line type, when I define those within GeoM line without using AES, all of the lines in the plot then get that characteristic. They're all size one, and they're all line type four in this case. If I wanted to change the line type for my different comparisons, well, I could then pull this out. And up here in my aesthetic for GG plot, I could then do line type equals comparison. And now you see that I have a solid line for the diaryl control case comparison, a dashed line for the non diaryl control case, and I think a dashed whatever some other type of dashing for this other comparison. I'm not so sure that this line type variation works really well for these data. I think I'd much rather have a solid line. To be honest, I think the line type aesthetic perhaps had more appeal back before the days when color publishing wasn't so common and was considerably more expensive. Back in the battle days when I was a grad student, we had to pay extra for color images in our papers. So you could do all sorts of tricks like changing the plotting symbol, as well as changing the line type. Anyway, I think we'll go back to a solid line before we're all done, but I want to show you one other thing. The line type comparison, I'll go ahead and pull that back out and put that back down here. One of the cool things about line type, which makes me wish I used it more, is that instead of giving it a number between say one and six, is that you can give it a name. So we could say line type solid, which is the same as line type one, and we will get out a solid line. We could also do dashed. This will give us a dashed line, which is the same as line type two, and then we could do dotted, which is the same as line type three. And then we could do dot dash to get a dot and dashed line. That would be the same as number four. Also for number five, it's long dash. And then number six is two dash. Anyway, all these different things you can do. I often can't keep track of the names because there is a little bit of jargon to them. Once you start combining dots and dashes, I'll do one through six, figure out what which line type I like the best and go with that. Again, I'm going to stick with the defaults, which will be the line type solid. So as I was showing you with the line type, if I put line type down here in the parentheses for genome line, without using the AES function, then all of the lines will be of the same type. However, if I put an aesthetic up here in the AES function, then what we're going to do is map different values from the data frame from that column that we tell it to the aesthetic we're interested in. So when we did line type equals comparison, each comparison, we had three comparisons got mapped then to a different line type. One thing I'd like to do instead of mapping a color to the comparison, I'd like to see if we can map it to the disease status. This raises another important point about genome line, and that genome line needs information about what points to connect. By default, it will use either the color aesthetic or the group aesthetic. And so it's using the color aesthetic to do that. And so if we look at rock data, then we see that the disease stat is this first column of truths and falses. True refers to the positive case of like diarrhea or case, whereas false is the negative like non-diarrheal control or the diarrheal control when comparing to the case. And so what it's doing in this figure is it's connecting with the teal line, all of the true values and the salmon line, all of the false values. What I'd rather have it do is connect by the comparison. So what I can add then is group equals comparison to my aesthetic line. Pretty cool, right? We see the teal line for the true indicating the positives, the red segments or salmon segments indicating the false or the negative cases. One idea I had would be to code false and true for our three different disease status groups. And so what that way, if this line was say blue and red for the two diarrheal samples with and without C. difficile, then I could use my coloring scheme that I've been using with my strip charts for those three different diagnosis groups. Otherwise, I'm going to have a different coloring scheme for these comparisons. So I think I'm going to go ahead and give it a shot, because I think it'll allow us to see some other things within R and R's power. To do this, what I need to do is to modify that disease status column to be the actual disease status. Let's go ahead. And I'm going to take the comparison column and actually split it in two. And we can do this with a function called separate. And so we can say separate comparison. And I will do into and then a vector of what I want the columns to be called that it's separating it into. And so we'll do negative and positive. And I'm going to add the argument remove equals false, remove equals true would remove the comparison column. But I want to leave it in there because again, I need that to link all the points together. I'll pipe that there. Actually, for now, I'm going to leave this off so I know what things look like. And so now I've got my negative and my positive and my disease status. So I can then pipe this into a mutate of disease stat. And I'll do an if else disease stat is true. So if the value is true, then if else we'll use the value in this first slot or the second slot, I guess it is. And that then will put the positive column value. And then if it's false, we'll use the negative. So we'll do the negative, the value from the negative. And so if we look at that, we now see that we've got our disease statuses labeled here. And so that way, then we can again have our three different colors for our three different disease statuses. And we can then pipe that to ggplot. And we'll do group comparison because that'll still be there and colored by disease stat. Let's give that a run. And here we go, right? We've got our case in red, our diaryl control in green, and our non diaryl control in blue. You know, I'm not totally sold. I like this. But it's it's kind of cool. And it's a different way of presenting the data so that I can again use my color scheme that I've been using over the course of these various episodes. So I'm not totally sold on this. And I don't know that I want to spend a lot of time fixing up the colors and fixing up the disease status labels. I think I think we'll probably go back to what we had before with solid colored lines. One other thing I wanted to point out to you is that if you if you look at the joins of the lines, you'll notice that they're no longer smooth like they were before when we had all one color for the line. And so this is what's called the line end. So we can change the way those lines come together with line end. And in quotes, then I can put round and sure enough, we see that and that looks that looks very nice. An alternative would be but the but seems to be what we had before and then square squares off where they come together. I don't know that I really like that. But maybe go back to round and that looks a little bit more appealing. Again, if I were to run with this and kind of fix up the colors and fix up the legend labels, I would stick with this line end round. Know that line end is the end of the line and what shape the ends of the line have. It's treating each of these little segments as a different line, which is why we need to set that line end type. There also is an argument for line join, which affects how those lines come together. Perhaps we can see that by going back. And let's let's go back and let's go ahead and drop these lines. And we'll come back to group comparison color comparison. And I don't I don't need that group comparison anymore. That again, we could do line join for round, which I believe is the default. Yeah. And so it looks like the default. We could also do miter. And the miter is going to be a more of a square step line. And then we could do bevel. And the bevel you'll if you if you kind of zoom in on this, we see that where it joins, it's it's not round. It's not square. It's it's beveled, I guess is the way to say it. I'm going to stick with the default, which was round. So we're going to work with this figure for a few more episodes. I'd like to make it look a little bit nicer. Let's start by changing the labels on the x and y axis. And again, we can do that with labs. And we can do x equals one minus specificity. And I'm going to capitalize the specificity y equals sensitivity. But again, sensitivity capitalized. The next thing I'm going to change are my comparisons. I'm going to change the order of them by modifying the comparison column to make it a factor. So we'll do mutate comparison equals factor on comparison. And then we need to give it a vector with our three different comparisons. And so we had non non diaryl control underscore diaryl control, and then non diaryl control case. And then we'll also do diaryl control underscore case. And then we need levels to indicate the order, right? And we'll use these same names. Pop that in there. And then we can then pipe this to ggplot. And so then we get our non diaryl control, we get we get the order that we want, right? And so again, the non diaryl control against those two samples with diarrhea, regardless of CD status, perform very well when just looking at Shannon diversity. Whereas this comparison of the two diaryl samples with and without C. difficile performs pretty poorly for differentiating using inverse Simpson value. So as we've seen before, we could then also add a scale color manual. We'll do name equals null so that we don't have a name for our legend. We'll do breaks. And I'm going to copy this vector because those names are just painful to painful to type in and painful for you to probably listen and watch me type. And then we'll do values. So the values that are going to be the colors of these three comparisons. So I'm going to go ahead and do orange, purple, black, these aren't great colors, but I don't want to spend a lot of time thinking about the colors. And so again, we've got three different colors that are different than the colors that we had for our three disease status groups. And again, if we were spending more time on this to think about, you know, what colors we might use, but for now, we're cool. This is good. Okay, and now we need to change our labels. And I will do labels equals and it's going to be a vector, right? And so we will do healthy versus diarrheal. And we'll do I'll I'm going to use gg text to highlight the format this. So we'll do see difficile negative. And I need another star after that. And let's go ahead and put in a break there. And we're going to keep going with this. And so we will do healthy versus diarrhea, see difficile positive. And then we'll do the third one, which will be see difficile negative versus see difficile positive. Alright, and so then let's put in a break here. And that should be good. Let's give this a run up. I notice I misspelled labels. And I've got an extra comma there. And so that looks good, except I forgot to turn on my gg text stuff with the theme. And so I'm going to move my theme classic to the bottom here. I always like to put the theme stuff at the very end, just so that I know where all my bells and whistles are. So I'll do theme. And then we'll do legend dot text equals element markdown. So that looks okay. The spacing between the different elements in the legend is a little bit tight. Alright, we'll go ahead and add legend dot key dot height unit. So I'll go ahead and put 20. And then the unit I need to include, I'll do PT for points. And so that gives us a little bit more separation. What happens if we did like 25? So that gets us more separation. I think that's that's good enough. I just want to illustrate that we can change the spacing between the different elements in our legend. So let's go ahead and now move this into the interior of the plotting window. We can do with legend dot position. And we give it then coordinates as a vector. So x and y. So let's go ahead and do say like, and it's it's relative to the size of the window. So let's do like 0.6, 0.2 as a starting place. And so yeah, everything gets kind of moved around. Let's go ahead and move that to the right a little bit. I think that the y position is fine. So let's do eight. And then let's go ahead and make this four for the width. Everything keeps changing. So let's do width of five. And I meant to do 0.8 there didn't I? So I think this is pretty attractive. It looks it looks pretty decent. I might play around with the colors a little bit more. Again, I like to have a consistent color scheme across my figures. This is something that's going to not me I know. Anyway, I think I think this looks pretty good as it is. So again, what we covered in today's episode is thinking about how we can modify say the size of the line to make it thicker to make it a little bit more bold. Alternatively, you could make it skinnier to make it more fine. We could change the colors. We could change the inter the color between the intervals. We saw how we could kind of get this like checkered look, indicating the color for each of the different disease status groups. Ultimately, didn't think that worked very well. Again, you've got to experiment with things to know that you like things or don't like things. We played around with kind of how the lines segments within a line come together as well as the line segments from different lines come together with that line end and line join arguments for a geome line. And we also talked a bit more about the difference between putting an aesthetic within the parentheses, say of like geome line, versus putting it in the AES function for the mapping within ggplot2. Again, lots of powerful things that we can do as we play around with building these plots and trying to make them attractive. Something I'll leave you with is see if you can play around with the colors to maybe make the colors look a little bit more attractive. Alternatively, see if you can go back to what we had before, where we had that checkered look with different colors corresponding to the different disease status groups. See if you can make that look any better. I don't know. Give it a shot. Let me know what you find out down below in the comments. Experimentation is really valuable. I encourage you to do it. We do it in science all the time. So why not do it also with our visuals? I think if you're not producing plots that look hideous, then you're really not trying. So anyway, give this a shot. Let me know what you think of this. Let me know what you think of that idea of kind of having that checkered color look to indicate what disease statuses are being compared by that line. Please keep practicing with this material. Again, as I say frequently, you got to practice this stuff to make it stick. It's not enough to just watch me make these videos and watch me code. You got to do it yourself, right? And so you've got engaged in the material and I've been so heartened to hear people are actually doing that and then applying it to their own data. That is awesome. You are doing amazing things. I really appreciate you. I really appreciate the time you're spending watching these videos, giving me feedback and kind of asking me questions to move us forward. Anyway, we've got more cool stuff on the way. Please tell your friends what we're doing. Subscribe if you haven't already. Welcome to all the new people that I've recently subscribed and we'll see you next time for another episode of Code Club.