 Hi, I'm John Little and you're watching the introduction to GGplot, introduction to our learning series. This learning series is sponsored by the Center for Data and Visualization Sciences, part of the Duke University Libraries. In this part two of this two-part section we're going to focus on GGplot and visualization. We'll learn about the grammar of graphics, the GGplot code template for making a visualization and then we'll learn about some of the various features and arguments and functions that we can use to visualize data. So let's get started. Starting off at our workshop series, this is part two of the Quick Start with R or is a introduction to the visualization of GGplot. You can find this guide from either one of these, but also at GitHub. And specifically we're going to use, I'm going to use this code right here, ggplot underscore quick dot rdm. I'm actually going to start with the HTML rendered report because it's a little easier to do the explanation, but you can get the code from this link right here and try it yourself on your machine. All right, we're going to load the tidyverse meta package of eight tidyverse packages that includes GGplot. It's also going to include forecasts and deployers. Those are useful. More information about those there. So GG in GGplot stands for Grammar of Graphics. This is a concept that was championed and developed by Hadley Wickham. There's a basic template that you can use to generate any GGplot plot. Looks like this. So you would say GGplot is a function and you would say data equals and you would give it a data frame and then you can map aesthetics. So mapping equals AES for aesthetics and inside the aesthetics function you need to map usually x equals to some variable in the data frame, y equals to some variable in the data frame. Then you have a conjunction and then you visualize a layer using one of the many geometric function layers. All of that we're going to talk about in detail. This is the formal template. In practice we tend to use, since we're doing tidyverse, we tend to use a template that looks like this. So we'll take the data frame, then put it a pipe that we can think of as saying and then. So data frame and then GGplot, aesthetics. We don't have to use all of the formal arguments. We can just identify the x variable and the y variable and the aesthetics and then give it a geometric function. So for this workshop let's just use a goal that we're going to try and create a scatterplot of mass over height or the relationship between mass and height for characters in the Star Wars data set which is on board data set part of the Deplier package. The other part of our goal is to simply draw a regression line as you can see. The Star Wars data set looks like this. You can scroll right and scroll left, you can scroll up and down, but basically it's characters in Star Wars films and data variables that we're interested in include mass and height. We're going to visualize that. Now to create or initiate a GGplot plot we start by identifying the data frame Star Wars and then send that to GGplot and what we'll get is a gray box. So it kind of feels like we didn't do much, but that's the initial step. No one ever just stops here of course, so let's just keep on going. Alright, in our case I'm actually going to filter out one of the, I'm going to filter out the heaviest character which is Jabba the Hutt weighs a couple thousand or more than a thousand kilograms. Let's just filter him out because it will make it a little bit easier to look at the results. So we're going to use standard Deplier data manipulation Star Wars and then filter where mass is less than 500 kilograms and then paste all of that or pipe all that to GGplot and assign the x-axis to height and the y-axis to mass. Alright, when we do all that GGplot will draw the basic grid for us. They'll put in some grid lines, they'll put in the tick marks that related to the data variables that we have mass and height, but we still haven't visualized the data because we need to do one more step which is to visualize a layer. Visualizing layers is usually done with geometric functions. I'll talk about those more in detail, but in this case the geometric function is G on point which makes a scatter plot, okay. So up until this point everything is the same and then G on point and that gives us this basic scatter plot. Alright, now I'd like to point out some of the argument characteristics that we have with GGplot. One is mapping aesthetics. Typically as an up here you'll map aesthetics in the global variable that's done up at the GGplot function, but you can assign local aesthetics and arguments in each layer. If you assign them globally they carry through or are accessible to all of the layers. If you assign them or set them in a geometric function they're available only to that function. So in this case what's available to all of the GGplot future layers is a data frame, a data frame that's been limited to characters with mass less than 500 kilograms. And then we get the exact same plot as above by mapping the aesthetics inside the G on point layer as opposed to inside the GGplot. Looks the same. But then the ability to move those mappings around comes in eight. Alright, so what are mappings? Now you've got a sense of that already. We map X and Y, but you can map other visual characteristics like color, fill, which is another color argument. Line type, if you're doing lines, do you want a dash line or a solid line? How wide do you want the line to be? Opacity, which is set with an argument called alpha. Shape, so if you're doing scatter plots, do you want circles? Do you want circles that can be filled in? Do you want squares or plus marks or X marks? There are actually many different kinds of aesthetic arguments that you can set, and they're usually unique to each geometric function. So you can always go back and check the help documentation for which arguments you can use. In this case, we're going to start with color. So up until this point is the same as everything we've seen before. I added in a comment to reflect the more formal argument assignment. But this is the actual argument. And we're adding in color equals gender. Gender is another variable in the data frame. We scroll up real quickly. Scroll to the right. There you can see we have both a sex variable and a gender variable. In this case, gender is listed as masculine or feminine or not listed. And so we assigned color to gender. And Gigiplot then automatically colored the points and automatically drew a legend, automatically labeled the legend. And you can see the results here. So the feminine characters are listed as kind of a red pink and the masculine characters are listed as a teal blue. So so far we've been mapping aesthetics. That is assigning some data characteristic to the data values based on a data variable. If the data variable was feminine, Gigiplot took care of choosing a color, but it consistently applied that color all the feminine points on the plot. Same is true for masculine. You can also rather than map these characteristics, you can set them manually to choose or override a different kind of characteristic. So in this case, outside of the aesthetic function, mapping takes place inside of the aesthetic function. Outside of the aesthetic function, I use the same argument color. And I assign it specifically to a chosen color or a color option that works in Gigiplot. There are many color options. We'll talk about those in a minute. In this case, I chose the golden rod option. You'll see that here, all the points have been changed to this gold yellow color, golden rod, whereas the default was to have them represented as black. All right, so that was all done in the basic template. Mapping or setting aesthetics to a particular geometric function called geompoint. There are lots of geompunctions. Some of the more common ones are now listed on your screen. So if you want to do a bar graph, you could use either geombar or geomcall. The difference is that geombar will calculate the rogue totals for you, whereas geomcall, typically, if you have rogue totals that need to be calculated, after you calculate them, you will use geomcall. You can check the documentation. Geomhistogram is useful for showing data frequency distributions. Scatterplots, we already learned the one geompoint. There's another one called geomjitter, which is used for overplotting. You can make a line graph with geomline. You can make a boxplot. Geomboxplot, if you click on this link right here, you'll get to the ggplot documentation, specifically to the geom section. You can see a listing of all of the various geoms. And if you drill down on any one of those, for example, if we click on geompoint, we can get documentation for geompoint. And here's a listing of the aesthetics that can be mapped or set. Okay, so just like we could take a data, transform data frame and visualize it with geompoint, we can do something similar and visualize it with geomboxplot. So in this case, I'm doing something special in this line to limit the number of boxplots I see. But I've got boxplots for the various species in the Star Wars dataset. The box represents the intercortile range, or the middle 50%. The line in the box represents the medium, so half of the characters are above that line, half of the characters are below. The line leading out of the top of the box is the last quarter, and the line leading out of the beginning of the box is the first quarter of the day. And then the dots are the outliers. A line graph can be drawn similarly with geomline, okay? You could combine geomline and geompoint. We'll get to that in a second. Let's talk real briefly about overplotting. Overplotting is, in the case of a scatterplot, you'd have multiple points sitting on top of each other and you wouldn't be able to distinguish how many points are sitting in that one spot. There's two ways to handle that. You can either change the opacity or you can use something called geomjitter. So changing the opacity is simply an argument, alpha. In this case, I'm sitting alpha to 30%. And then what I've got here is the more points in the data that are in that same spot will show up on top of each other making the point darker, meaning there are multiple points right here, several points overlapping right there, the points overlapping right there. So that's one way to deal with overplotting. Another way to visualize overplotting is to simply use the geomjitter function and that'll push the data points, repel the data points away from sitting on top of each other, which will give you a cluster sense that there are multiple points there. It won't alter the data itself in the data frame so you can continue to keep your data pristine, just visualizing and representing that overplotting in a special way. All right, let's talk about layers. We've alluded to layers so far, all of the plus we've done so far, I believe, are a single layer where we use the ggplot conjunction and then we use one layer, in this case, geomjitter. But we could use two layers. Now this plot's a little more complicated, but let's just draw out the highlights of this. There are two layers here, one's a geomline and one's a geompoint. So in the global aesthetic argument, I'm mapping the x variable to year so it's a time series plot and I'm mapping the y variable to prop or the prop variable. So this is the proportion of names and actually somewhere else filtered the names down to just two, John and Elizabeth. Then in geomline, mapping color to six. So I've got male names and female names. So this first part of the plot would give me the blue line and the pink line and then add another layer, a point layer, where I set the shape of the points to crosses. So that's where these little x marks show up and I set the opacity or the alpha argument to 40% and that's why they appear as gray rather than black because they're partially see-through. Now if you wanna see the whole code for that, it's right there, can run it on your machine. But remember, now that we know all those basic steps, how to map aesthetics, how to set aesthetics, how to pipe in your data, how to pipe to multiple layers, how to use the global or the local settings, our goal was to create a scatter plot that showed the relationship of mass to height which we've got here and that also drew a regression. So that's just one more layer, the layer called geomsmooth draws this blue line. It uses the formula that's identified in x and y up here, right? So it's doing a linear regression or a linear model, that's the method and the linear model wants to know what the response variable is and what the predictor variable are. So the response variable would typically be the y variable and the predictor variable would typically be the x variable and since we already have identified those here in ggplot, it simply picks up on that. The formula doesn't have to be represented again because it knows what to do with that x and y. In this case, we're identifying the linear regression model and we're setting the confidence interval or the standard error to false so that part's not represented on the screen, the standard error and as a result of geomsmooth, we get this blue line showing the basic prediction of height to mass. This would be a good time to stop. You could do some practice sessions and this would be a great place to go to, for example, our studio primers and practice some of the visualization exercises. So let's talk about what we can do with geombar, yet another geometric function. In geombar, we're taking a new dataset, part of the ggplot package, ms sleep, which calculates the sleep of various animals via the carnivores, herbivores, insectivores, or hondivores. And so what we're doing is we're piping that dataset to ggplot or mapping the variable vor, that's carney, erby, insectian, omni to the x variable and in the case of geombar, we only need that x variable because the y variable in this case is going to be a count of how many of each of those categories there are. And that's what geombar does is it will count the rows and then give you a bar total. So we know that herbivores are the most popular at slightly over 30 and we know that insectivores are the least popular, it looks at roughly about five. Now, this is, plot is good enough, but what we want to do is learn about arranging. In order to learn about arranging, we're going to bring in another tidyverse package, package called four cats. What four cats does is it enables you to manipulate your character variables as if they were factors or as if they were categorical. And the reason why we want to do that is because it makes it simpler to arrange these bars in order and what we want is we want to arrange them in descending order because it will be easier for people to interpret the results, right? So we're using the four cats library to invoke a function in four cats called fct underscore infreak, as you see right here. And what that does is it will enable us to easily order the bars by their frequency. And then it becomes easier to tell that there's a difference between omnivores, carnivores. So nothing about this particular plot changed except that I used the fct underscore infreak in frequency function inside of the aesthetic function to transform the bore variable to a categorical factor and then present that in frequency order. That's what the fact frequent does. There are several four cats functions. They all do different things. It'll allow you to manipulate, manipulate categorical or factor theta. We've now got several pieces of the grammar that we can understand and we can develop a slightly more sophisticated graph. In this case, it's a two layer graph. We're going to present, we're going to map eye color in frequency and you'll notice I pulled out another four cats function here, fct underscore rev which reverses the order of the bars. And I'm doing that because down here the last thing I did is a chord flip. I flipped the x-axis to the y-axis. I think actually now with ggplot3 I could have just said y equals icol. But in any case it's good to know about chord flip because sometimes you want to flip the x-axis and that's a nice easy way to do it. But since I flipped the x-axis I had to reverse the order of the sorted order of the factor variable. So after I generated a reverse order sorted order factor of the eye color variable I'm going to visualize all of it in gray and you'll see that all except for this one. But there's actually a layer below this orange that is a gray bar that fits perfectly underneath it. And then I'm going to do another layer also a geombar function and I'm going to identify a new data frame. And in this data frame I'm going to use some deplier functions to limit my data set only that part of the data frame where the eye color variable has a value of orange. Okay, if you need to know more about deplier you can watch part one of the quick start video or you can watch the deplier video. So I've subset the data of Star Wars, the same Star Wars data but in this case I've subset it just to where eye color equals orange and that's going to be roughly I think eight rows. And I've set the fill to a color called dark orange. One thing to note here is that we've now used things like color in some cases and fill in other cases. Fill is usually the interior color and color is typically the border around objects that accept a fill argument. In the scatter plot it only accepted a color argument for the shape that we used. There are other shapes, there are other circles that have a fill and a color. The shapes we use, the default shapes didn't accept both of those arguments but just know that there's a difference between fill and color. If you play with it you'll see how they're different. Another advanced thing that we can do is we can use facet wrap for two really nice functions called facet wrap and facet grid. In the case of facet wrap what we're doing we're doing a standard scatter plot that we had done before. But then we're faceting by a different variable called class. So two-seaters, compact, mid-sized cars, subcontinent cars, pickups, many vans, and SUVs. And what it does is it generates a subplot for each one of those categories and it makes it a little easier to see the trend within each one of those subcategories. All right, let's talk about scales. Scales are used to affect the visual qualities of the data. So I'm gonna introduce color first. It's I think the easiest way to understand scales but you can use scales for other reasons and I have a second example. Let's say for example we have a variable in our MS sleep data frame called conservation. And conservation is a category about which we're concerned of each one of the animals. In this case conservation has things that are endangered, things that are at least concerned, animals that are domesticated, those kinds of things. I'll fill that out later. I can set sub parts of the bar by filling in using G on call with the conservation variable. Now my point is, what I wanted to talk about scales is I filled these colors in with the fill argument, mapping the fill argument aesthetic to conservation. If I don't do anything at all, GGplot will choose colors for me but there are other ways to manipulate that color and we manipulate that color with scales. So in this case, we're gonna find appropriate scale fill function. Since we used fill here, we're gonna use scale underscore fill here. If we had used color up here, we would use scale underscore color there, all right? We'll use fill. So scale fill underscore and then we're gonna use the Veritas library and in this case the function is veritas underscore d for discrete. Think if you click on this link right here, you can see all of the various ways that we can use scale and you'll see that scale, Veritas has d for discrete, c for continuous and b for bend. So we're using d for discrete because it's a categorical variable and we're setting a particular value for the gray and a category. But other than that, Veritas chose a vibrant set of colors, easy to perceive and can be printed to black and white printers and still be easy to perceive. Veritas has other palettes but this is the default palette and works quite well. If I wanted to use a different color palette, I can, I have color palette options again it's all about using scale. So I'm mapping scale underscore fill or I'm associating scale underscore fill with this fill argument that's mapped in this geometric function. So for example, other than Veritas I could use the color brewer library. So the same fill mapping that takes place here but in this case I'm gonna use scale underscore fill underscore brewer and because I know that this is qualitative data I'm setting the type to qual and I'm still setting the gray and what happened was brewer chose the default palette and I get a different range of colors here to represent the differences in conservation type. I could also use an argument such as palette equals dark two come back to that in just a minute. Yet another way to manage color is to manage color manually. Okay, so if you wanna pick your own colors you can just like we were using scale underscore fill underscore Veritas or scale underscore fill underscore color brewer we can use scale underscore fill underscore manual. So here I have set character vector of color of colors and the first argument under scale underscore fill underscore manual is values and the values needs to take color names or hex numbers. So for example, if I search hex color for dark orange there's the hex value or the RGB value you can use those arguments as well but sometimes the names are easiest so values equals my color vector and I'm still setting my grays for NA now I have yet a different customized set of fill colors. All right, I spoke up here about the fact that I could use different color brewer palettes and that one way I could do this is by using an argument palette equals dark two but you might wanna know well what are the colors that I can use? So a simple way to do this is to go to Google or whatever search engine you want and just put in the phrase are color names and usually the first two or three options that come back are all useful in showing you range of options and giving you color names. A specific way to do this with color brewer once you load the color brewer the R color brewer library you can call the function display brewer all display dot brewer dot all and it will draw this little palette option for you and you can see the qualitative variable color palettes the diverging color palettes and the diver ascending color palettes you can use all of these codes as you like such as palette equals dark two or palette equals set one or palette equals BUPU for blue purple. Another way to use scales is that you can use scales to change the scales that you see on the outside of the graph and it's very common in data science to use a logarithmic scale to tease out part of the data story that is harder to realize on a standard scale, right? So if we use the chicken weight data we can draw a line plot time series of the diets being fed so the weight as a response variable to time based on diets that are being fed to various animals, right? But if you look at this data in a data frame and you can use a geom histogram to view this data you'll see that the data are right skewed and because they're right skewed the data story is a little bit harder to present in a visualization so it's very common in a case like that in case of skewed data to present the data over a logarithmic scale and so one of the scale options is scale underscore y underscore log 10 and that simply changes the y axis to a logarithmic scale and if you look at these two variables together or these two plots together what you see is a slightly different data story. Indeed, it's clear here that there's a leveling off of the diets after a certain amount of time versus here where it looks like the diet gets better almost exponentially over time which is not actually true. So that's a scale, another scale option. So what scale allows you to do is set allows you to manipulate color, allows you to manipulate how the data are presented along the scale but it's ways to present visually the data that you have in your data frame. Now this is another place, a good place to stop perhaps. There's a good set of exercises that you can look at in the GG plot workshop that I have. If you go back to the guide there's a good set of exercises former colleague of mine she's created a whole series of GG plot exercises that are nice to work through and they all have answers or you can also continue to work your way through the R Primer series and I'll just say that this is a good place to stop before I go on to labels. Okay, this last section I'm not going to narrate quite as much but let's just run through this. There are other things that you want to manipulate about your graph. For example, the labels you might want to change your X, your X axis label, your Y axis label you might want to give a title or a subtitle or have a caption or change the title to your legend title. You can do all of that with the labs argument which you can see right here and it's pretty self-explanatory. Labels are a specialized part of the scales function. Themes on the other hand are typically not data related. They're more to do with the style of the plot that you're developing and the fact that you can quickly change a theme from one style to another. So you see here built into GG plot two several classic themes that you can choose just by using theme underscore classic versus theme underscore dark. Let's take a look at a couple of those. So here is our plot just plotted with theme underscore dark. You'll notice that we started with the phrase plot underscore sleep. That's because you can assign all GG plot arguments to an object name and the result will be accessible as this object plot underscore sleep. So then if I want to change underscore sleep by theme I can apply all kinds of GG plot functions to plot underscore sleep, not just themes. But here's a way that I can quickly change the default plot to theme dark or quickly change it to theme underscore classic. Or I really like these HBR themes and one that I like in particular is the theme ipsum which chooses this color palette and you'll notice a couple of the things worth noting is that it comes with some of its own scale options. So I'm using scale fill underscore ipsum to also change the values of the variables in the legend. You can see that happens right there. And then one of the things that's new, so in then one of the things that's new with GG plot three is that I have an argument called plot dot title dot position which is part of theme and that allows me to move the title all the way over to left justified as opposed to plot justifying. And other things that you can change with theme are things like the grid lines. Okay, some other really cool things. There's a package called patchwork which allows you to combine plots together. It's dead simple by a really interesting developer Thomas Lynn Peterson. And so after you load the library you can use the slash or the pipe depending on whether you want the plot objects to be side by side or one over the other in the case of the slash it's one over the other and that puts both of these objects one over the other into a single image. If you want to do interactive plots a really nice way to do that is to take your GG plot object and use the function plotly function from the plot library GG plotly. And that will take this bar graph looks similar to stuff that we've already worked with but whereas graphs like this don't have any interactivity this graph does. So I can get fly out windows on this bar graph. I can click and drag to zoom in to a particular bar graph or if I want to see really closely like what is that green thing that I couldn't see before if you double click it'll zoom back out. By default it'll give you this toolbar you can turn it off if you want to. And then some people really like animations. Animations are something that I would caution you about. It's great power but you should use the power judiciously but once you decide that you are going to map something animate something the GG animate package which will take some time to get used to. Will generate moving aspects to your visualization. So that's everything I wanted to tell you about GG plot. Thank you for listening. Feel free to have a look at these further resources. See you at the next part of my presentation.