 All right, so welcome. We are going to be learning today about data visualization using an R package called ggplot. And so our goals for the day are we're going to produce some scatterplots, boxplots, density plots, and sign time series plots using ggplot. We will set some local and universal plot settings. We'll describe what aesthetics are and how they're used. We'll learn about faceting and how to modify the aesthetics of your ggplot once you've made it, so how to make it look different, and then how to export publication-ready graphics using ggsave. I think we should get to all of that. And we're also using, gosh, my computer's super slow. I apologize. We're using this tutorial, online tutorials. Does everyone have the link to the online tutorial? It should have been sent in an email that bought yesterday at 3 PM. And if you don't have it, you can see it at the top. It's rconnect.math.lontana.edu slash capital Ddata underscore this underscore tutorial. And Greta probably put the link into the chat. So Greta and I and Mark Greenwood created this tutorial using a Shiny app, which are related apps. So all of the computation goes on behind the scenes. And you can run code through the browser so that you don't have to worry about using our studio and struggling sometimes with technical issues. So this is all about the code. You'll learn how to write the code. And you can run it in the browser. And hopefully, it'll all go smoothly. And you'll learn some new code today. So it's just for playing around. And we have these sandboxes that I'll show you where you can type the code, run the code, and then you can clear them. The tutorial also has a check your work option, which sometimes works, but it doesn't always. There are sometimes many different ways to arrive at a right answer. So if you click Check Your Work, but it doesn't work, you can also look at the solution in many of the challenges in this tutorial and check to see what it says there. And if you run the code and you get what you're looking for, obviously you're doing it right. Okay. I think up to you. I'm gonna use my computer for the rest of the year, so. I'll just use mine. Okay. All right. I am going to increase the font size here a little. Oh, thank you. Just for my sake. One more. Okay, so we're going to be talking a little bit about packages, functions, and we're gonna work towards using a very particular function that will help us create blocks. I want to know more about this. And if nothing else, it could help you, what do you think? You were able to attend that interaction workshop. That's so that you have an idea of what a function is, but if not, we'll go through it step by step. So, we are going to be using to create Gigi plots. We're going to be using some packages that come in a suite or in an encompassing package called the tidyverse. So like the tidyverse is a series of encyclopedias and there are different editions that have different features in it, including tidyR Dplyr, which is sort of like SQL or data, flying, I can't think of it, database queries, and then different ways of importing data and some other functions. We're not going to be using a lot of these. We'll get into more of the other packages in the data wrangling workshop that will run in the spring. And if you don't want to wait until the spring, we have those online. So briefly, the tidyverse package tries to address a lot of common issues that arise when doing data analysis. And it tries to solve complex problems by finding many simple pieces. And one quote is, no matter how complex and polished individual operations are, there's often the quality of the glue that most directly determines the power of the system. And Hal Ellison said that, and it's also written for people to read. If you can't read it, don't worry. Sometimes it takes a while to learn a new language, but it's supposed to be a language that doesn't take as long to comprehend and understand. Hadley Bookum said, computer efficiency is secondary concern because the bottleneck in most data analysis is thinking time, not computing time. And I think even in the chat GPT world, I think figuring out your prompt probably takes the most time rather than waiting for the answer from chat GPT. So I think that this is still relevant. And I still haven't used chat GPT, so I'm talking about something that I don't know anything about. But moving on, if you were to be doing this workshop in RStudio, which I hope that you're not doing and hope you're using the web app, but if you were to do it, the first thing that you would need to do is load the tidyverse library and you would do that using the function library and tidyverse and the parentheses. Again, you don't have to do that today. That will load ggplot2 and as well as anything else that you might need. The data that we're gonna be working with today is a data that we did not collect, that somebody else collected, but it's publicly available, that we got from Data Carpentries. So if you have any particular questions about how this data was collected or what the observations are, you can sort of answer that, but we'll just think of this as an example data set that we can play around with. But in short, they had different exposures and enclosures and they were capturing rodents and taking measurements on them and they had different types of exposures or enclosures. And so they captured all that information. They captured the month of the observation, the day and the year and identification for the plots that they used, a species identifier, the sex of the animal, the hind foot length and millimeters and the weight in grams. And so we'll use all of that throughout today. And this was over many years. And so we have a subset of this data that's just to make it hopefully so that we don't have any computing issues on the server that we're using here at MSU. It is possible that we could run into some slowdowns and refreshing the page generally works, but we'll hopefully not encounter that. You don't need to run this line of code to get the data. But again, if you were later wanting to take this, what worked today and do re-implement it in RStudio, then you could run this line of code that downloads the data from GitHub and saves it in the surveys data set. So if we run this, nothing is going to happen. You don't need to run it to have it work, okay? And after we run it, if we wanted to look at the data, we could just type in this, what we call a sandbox and run this particular line of code by control enter or command enter, or you can just run the code and we'll get a preview of the data. So we didn't see parsed with column specification this time. Sometimes we'll get messages and sometimes we won't. They've changed the Shiny app interface a little bit since we first did this a couple of years ago. But if we wanna look at the structure of the data, this was something from last time, STR stands for structure, so we can just run that code and it'll tell us about all the variables that we have in the data set, how many observations there are in a preview based on the number of observations that we see is based on how many or how long a particular thing is. So plot type, each string is long, so we only see for the first four observations. Okay, so that'll be important later on. And we can preview the data. If you were in our studio, you would do capital V view of the surveys, but we can just type in surveys to get a preview in the sandbox. All right, so now that we've got our data, we're gonna jump straight into the good stuff. But before we do that, a lot of what we're doing today is going to look similar to what we will talk about in more detail in the data wrangling workshop. And so it's going to be sometime before that, so it'll probably need a refresher when we do that workshop. But keep in mind that we're going to be doing some technical things, so if you have any questions and need me to slow down, just let me know. So ggplot is the function name that we'll use to create a plot. And if we call ggplot, it's going to set up a template for a plot, and then we have to give it more instructions. And the way that it does this is it's like an Adobe Illustrator document where you have layers upon layers. And so if you put everything on one layer, it's hard to isolate different features and make adjustments to it. So ggplot likes to have each feature on a different layer to make those adjustments more streamlined and intuitive, and then we can move the layers around to get different effects. So think about it as we're just building layers for a particular visualization. So the first thing that we can do is just set up the frame. So if we do ggplot on our data, so data is a parameter and our data's name is surveys. And so if we run this, let's see what happens. Make sure you don't have anything highlighted when you go to run code or to only run what's highlighted. All right, do you see anything? No, good, you shouldn't see anything. It's just telling the code or R to expect something coming up, okay? So there's actually a blank plot there, but we can't really see anything. So now we need to tell it what we wanna plot. So we're just going to continue adding different parameters into this ggplot function. So we're to the data equals surveys, we're going to tell it what variables to plot using a mapping. And it's going to use this function called aesthetic. So AES is short for aesthetic and that's going to specify what variables to plot and on which axes. So mapping equals AES for aesthetic, parentheses, on our x-axis we're gonna put weight and on our y-axis we're gonna put hind foot length. We don't need any quotes or anything like that. It knows those are variable names. And so let's see what happens when we run that line of code. All right, we can see something it's still not very exciting, but now we've got axes, right? So on our weight we know that it ranges from zero to 200, maybe a little bit beyond, but we can only see 200. And on our y-axis we can see hind foot length that ranges from zero to above 60. There's something going on there, but we can't see anything. That's because we haven't told that how we wanna plot the data. So next we need to say how we want to visualize the data. And we're gonna do that with geomes, short for geometries. So there's lots of different kinds of geomes. There's geome point to create scatter plots or dot plots, geome box plot for box plots, geome bar for bar charts. You can also do geome call for column charts. We'll talk about that later. Geome line for trend lines, time series, distributions. So if we wanna plot weight and hind foot length, both of those are quantitative variables, two quantitative variables, a good way to visualize that that is with a scatter plot. And so we can add in geome point which should give us a point for each observation in the data. All right, now we got something going on. So to our length frame, we added our axes. Then we added the points. And how did we add that? We added it with, what's the symbol that we use for add plus. So after the ggplot function which sets up the structure, we added a layer using the plus symbol to get the dots, the points. All right, we need to make sure that the plus sign is at the end of the previous line or in the middle. So we don't technically need to have this carriage return here, new line. We could have it all on the same line and it would work, but it could make the code a little bit harder to read the more things that we add on together. The only thing that you don't wanna do is have the plus sign on a new line, okay? R studio or R is really nice now that it actually will detect that and give you an informative message. Error messages didn't always used to be as informative. They're getting a lot better. And so it says cannot use plus with a single argument. Did you accidentally put plus on a new line? Yes, plus is actually the beginning of a line instead of at the end of a line. So if we move it back and now it should run and that error message should go away. Okay. All right, so we're going, we've already built the plus iteratively by adding two layers, the axes and then the points. Let's continue adding more information or changing how we're visualizing things. Some of it's going to be on the same layer. So when we're building GG plots of GG plot two, it's an iterative process. We started by defining the dataset, laying out the axis and then choosing geometry. So let's go back and run this again, even though we ran the same plot several times now. Let's start modifying it. All right, what's one thing that we can do to help clear this up? Well, there's two big clusters of data points and we don't really know how dense those are. So maybe if we had a little bit of transparency, we would be able to see how many points are actually in those really super dark clusters. And adding transparency is called adding alpha to the points and that will help with over plotting. So we're going to start by putting alpha inside the geom point parentheses that we have. And we're going to start with an alpha of point two. Let's see what happens to that. It's still really dark, but now that's giving us more information that there's just a lot of points in these clusters here. We could reduce the alpha even more. And so now there's hardly any points. We can really only see the really dense clusters. So maybe that's not super informative. What would you do if you wanted to have no transparency? So completely opaque, what value would we use? One, yes. So one should look exactly like one if we didn't have an alpha specified. On the other hand, if for some reason we wanted to completely hide all of these points, what would we use? Zero. And believe it or not, there are reasons that you would want to set the alpha to zero or certain things and put it back to point two. All right, black is nice, but a lot of journals will allow you to have color plots now in submitting homework assignments. You don't have to print it out necessarily anymore. Reports, color is always nice to make things pop. So let's change the color of these points from black to blue. And we can do that by adding another parameter in the geom point function. We'll just say color equals blue. If you were to do this in RStudio, RStudio actually colors the colors. So if you don't get a rectangle around blue, if you're in RStudio, that means you're probably using a color that doesn't exist in its named colors. Unfortunately in the Shiny app, we can't see the nice blue box. But that's a nice little helper when you're in RStudio. All right, so this is really nice. Blue, you can play around it. It will play around with different colors. A lot of times I think that should be a color, but it's not actually a color. There's lots of different cheat sheets out there. And I think we'll come to a link for a cheat sheet that will help you get to some of the lots of the named colors. Other things that we can change are the size of the point and the shape. And the size of the point is it's width in millimeters and the shape of a point has five different options for plotting. We can give it an integer number that represents different shape values, the same as base R. You probably don't know those, but they're on cheat sheets. The name of the shape, so such as circle or diamond. And then a single character, we can plot letters like ABC. Those can be instead of a point. If you do a period or a decimal, that will draw the smallest point that's visible. So it's about one pixel. And we can put an NA if we don't wanna draw anything and we don't wanna alpha to be zero. So circle open means that you have a circle with nothing in the middle. So just the outline of a circle. Diamond filled is actually a diamond that we can fill with a different color than the outline. So when it's diamond filled, that doesn't mean a solid diamond. It means that you wanna specify different color for the inside. All right. And there, here's a reference for different shapes and integers and characters. Oh yeah, that was the other note that I added, Sarah. Oh no. Okay, so coming to our first challenge. So we wanna copy and paste the code from the previous code chunk and modify it to assign one of the aesthetics to the geome point. So change the shape, change the color, change the character, have it try to be an open or filled. So I'm gonna go up here. And this is the first one where we actually have a solution. But it's just one different way of specifying or answering this question. So if we look at the solution, you can see that we went with shape equals diamond. But you don't have to do that if we were to submit our answer here. It'll say, I did not expect your call to include color blue. And so it's not super helpful, but it will give you some feedback. And again, this is just for playing around. So we're not actually grading anything. So just try changing things. So I am going to change my shape to letter G for Greta. And I'm gonna change the size to be five. Let's see what happens to that. And because my alpha is still set to point two, it alphas even the letters. And so you can just see a bunch of Gs. And we could do, let's try a diamond. And because we specified a color but not a fill, it's empty. We'll talk about fill here in a little bit. So if we wanted solid diamonds, we would just say diamond. All right. Well, it's nice to be able to specify the data as a parameter in the GG plot, but that kind of takes up a lot of space. And now we're gonna introduce a concept that's really something more sophisticated. So this is something that we'll talk more in detail in the data wrangling workshop, but we need it here because it really helps things out, helps clean things up. We're gonna talk about piping data in. So we've got our data set, right? And from our data set, we wanna create a plot. And then on that plot, we wanna create layers. And so instead of having to have the data equals blah all the time, we can just send the data into GG plot. And we're gonna do this with a combination of symbols. That's a percent symbol, greater than symbol, and then percent symbol again. We together, those three symbols, we're going to call a pipe operator. There is also a new pipe operator in base R that's a vertical bar in the greater than symbol. It works similar to the percent greater than percent pipe operator, but it's not part of the tidyverse package. We're just gonna keep the tidyverse way of doing it because there are some different things that this pipe does. We don't wanna talk about that right now. All right, so what this does is we start with the name of our data set and then we give it the pipe symbols. So percent greater than percent, that sends it in as the first parameter of the GG plot function. And so you notice that we don't need to have data equals there anymore. And the next parameter is the aesthetics. This should get us exactly what we had before, but now we're just having this idea of a flow, starting with our data flowing through a visualization. And we technically don't need this mapping equals. It will automatically know that AES is the mapping aesthetics. All right, before we start adding more variables to our aesthetics, are there any questions? We go to online? Okay, okay. All right. Now let's look at coloring each species. We know we've got other information about these rodents. Let's think about coloring each species differently because maybe some of those clusters are a particular species that are more similar in their weight and their hind foot length. And so what we're going to do is we're going to start the same way. We're going to take surveys, pipe it into GG plot. We're going to map weight to the x-axis, hind foot length to the y-axis. We're going to do a scatter plot with geom point or alpha is still point two, but now instead of just color equals blue, we're going to have another aesthetic AES function. And in the parentheses, we're going to have color equals a variable, which happens to be species ID in this particular case. Let's see what happens when we run this. All right, now we get one color, a different color for each species. Unfortunately, GG plot uses the same aesthetics not in the legend as it does in the plot, which is probably not what we exactly want, or uses the same alpha. So we can override this by this complicated function. Let me add this in here and we'll talk about it. So plus to add another layer. Guides is for how we want to guide the legend to look. We're going to change the color guide. And then we have to say guide legend because we just want it for the legend and override the period AES to override the aesthetics. And then a list because we could have multiple things that we're overriding alpha. And we want in the legend, we want the alpha equal one. Again, this is a complicated bit of code, but it's one that you just want to make a note of and then come back to it when you need to use it, when you're creating one of these plots. All right, so here we were changing the color in the point, the geome point, but we could also change the color when we first specify the aesthetics. So we could just move that color equals species up to the first time we specified the aesthetics. And we should get the same thing that we got before minus the alpha is still in the legend. So this, when we specify it initially, this is going to be seen by any layer that we put in the plot. And sometimes we want that and sometimes we don't want that. When it is in the initial GG plot aesthetics, we call that a global specification. And when it's in a particular geometry, a geome, we call that a local specification. So sometimes we want our aesthetics to be local just for a particular geometry. And sometimes we want them to be global. And we're gonna do this by adding another layer called a smoothing layer. So we're going to smooth out all of the points. We're gonna do one more thing. We're gonna change geome point to geom jitter. So that's going to add a little bit of noise to all the points to help with the over plotting. And let's see what happens here when color is specified in species. All right, when color equals species is specified globally, so the first aesthetics, we have one line for each cluster of points. And some of the points are really hard to see. There is actually a cluster of pink points here. There's just not very many of them. And some of the lines we can't really see very well just because of the screen, but you might be able to see it better on your own computers. But we have one smoothing line for each species. That's because the color is specified globally. If we were to specify the color just for the jitter, again the dots, let's see what happens. So this is color specified locally for just the jitter geome. Now we're smoothing out all the points, just doing it once. And so it doesn't know about the coloring because it came from a local specification. It's probably not what we wanted, right? So we wanna be careful about thinking about if we want to have a global or a local specification. All right, challenge two. So we won't take some time to do this, but you could go to the geompoint help file by clicking on this link or running question mark geompoint to see what other aesthetics go here. But we could map a new variable from the data set to another aesthetic in our plot. So we'll do this together to keep it moving, but what I wanna do is go up here. We've played with color, color equals species. What is another aesthetic that we have for points? Shape, okay. And let's see here. Let's go back and remember our variables. And I'm just gonna do names, surveys. And so we've got record ID, month, day, year, plot ID, species ID, sex, hind foot length, weight, date, day of week. Okay, so let's maybe do, let me say shape or size. Oh, let's do size. Size. Okay, let's do different sizes for sex. So let's do size, click on comma, size equals sex. Males and females should get different point sizes. All right, so we've got big dots and small dots. And we now get a new thing in our legend and it says that females get the small dots and males get the big dots. Any idea why females got the small dots and males got the big dots? Or we got the bigger one? Males. And they coded just as two, one, two. Females got one, males got two, but it's because it's alphanumeric, right? So R loves to alphabetize things. And so the first alphabetically gets a point size of one, second alphabetically gets a point size of two. We could do, we could change the size based on species. Let's do plot ID. There's a lot more of those. So there should be considerably more points that we, or different size points that we get. And we can, it's flowing off the space that we were allotted for this plot. But you can see that we, as the plot IDs get larger, the points get larger. All right, so using what you did, and I want you to actually do this on your own. Remember, you can peek at the solution and try to take, do this. We'll give you two minutes to try to get a scatter plot of weight over plot ID with data from different plot types being shown in different colors. And is this a good way to show this type of data? Think about that question after you get your plot. So I guess we'll wait until 3.42. And raise your hand if you need help. We have so many helpers in the camera. Getting set? All right, so here's the structure of how this should look. We take our surveys data. Sorry, I got to answer the question real quick. Yeah, yeah. Just recall the data, you don't need to see it there. Yes, we had a question about why, again, why we use the pipe. Is it like just to call the data one time so you don't have to put it in again? Yes, and it just kind of stays a little bit of space in the GG plot specification. And it's more consistent with the idea that we start with the data frame into the data frame. We then do something with it, which is in this particular case, is get a plot, but we could also summarize it and get a table of values. And it makes it easier or more intuitive to read. Okay, answer. Any other questions? Yeah. Okay, so what do we put on our x-axis, plot ID? And what are we going to put on our y-axis? Great. And what are we coloring by? A plot ID. Again, our plot, we have plot type in here. And just plot ID. Sorry? Doesn't this have to be variance for plot type? It is. Let's see here. Plot ID. Oh, yeah. Plot type. Perfect. Okay. So plot underscore type is what we want. We shouldn't have lost all of our. Instead of plot ID, we want to plot underscore type. I got too many close parentheses. There we go. Okay. So we have lots of different colors. Again, there's control plot type, long-term crat, exposure, rodent, exposure, short-term crat, Cape crat is kangaroo rat. I'm remembering now. And then spectab exposure. And we've got different plot IDs. And so we can see that there's only a single plot type for each plot ID. Is this the best way to visualize this kind of data? That's a leading question. How about visualizing this with a box plot? And we've got a lot of different plot IDs. So we're going to simplify it by going and just looking at species ID instead of plot ID. But we could do this with plot ID. All right. Everybody have hopefully at one point in your life has seen a box plot that just in case it's been a while or maybe you haven't seen it. A box plot is a particular type of plot that was created back in the days when computing power didn't exist or was in the human brain. And so it captures different kinds of information. If there are outliers, they're represented by dots. And an outlier is if it's greater than 1.5 times the height of the data. If it's greater than 1.5 times the height of this box, then it would be a dot. The bottom of the box is the 25th percentile of the data. Top of the box is the 75th percentile of the data. The line in the middle or in the box is the 50th percentile. So 50% of the data is below it to 50% is above. If there's no dots, the bottom of the line is the minimum value or the maximum value. And then if there's outliers, if they're more extreme than what we would expect of 1.5 times the height of the box, which is called the IQR, the inner quartile range, it would be a dot. So you don't need to know statistics to be able to visualize this. It's a nice summary of the data. We don't really like box plots anymore because we have computing power that can visualize all of the data and not just these summary statistics, but we can make a box plot better by adding the actual data to the box plot. So that will give us a better idea of the number of measurements in each box plot and their distribution so that we can really get an idea of what's going on here. So we're going to add another layer. And so we're going to start with the box plot that's species ID on the x-axis, weight on the y-axis. We're going to add another layer that's the points. Instead of doing geom point, again, we're going to add a little bit of jitter to it with the geom jitter. We're going to color the points tomato because tomato is a fun color to say. And let's see what happens here. So now we can see that for some of these species, the weights are really dense. There's not a lot of variability in the weights, but for a couple of species in particular this NL, the weights are really spread out. So there's a lot of variability in that particular species, which I don't remember off the top of my head. And even if I did, I probably wouldn't know how to pronounce it. So we're just going to call it NL. Not a biologist. Okay, challenge three. Work, and I'm going to help you with this to keep it moving along, and I'll let you have some time to work coming up here. So box plots are useful summaries, but they hide details of the shape of the distribution. So a better or a more modern version of a box plot is called the violin plot because it takes into account the density of the points and it reflects it across an axis. So we get a point or a distribution that could look like a violin, which is why it's called Geo Violin. So we're going to start with what we had before and just change it, change the box plot to Geo Violin. So I'm just going to highlight the box plot part. And there's actually a code block down here that you could use if you wanted to submit your answer instead of just modifying it directly in that one spot. So some of these distributions are really spread out, so it just looks like a line. You can't really see any shape to it. Some of the other ones we can see a little bit of shape, so it sort of looks like an instrument or OL, but it's hard to see again because of the weights are just so diverse between these different species. Any questions about that? Some of the ones that are really dense, like you can't see the line coming through on them. Yeah. So let's fix that. All right, we'll fix that here in a little bit. We're going to, instead of doing the fixing it with the violin plot, we're going to fix it with the box plot. So what we want to do is let's look at a new plot. So we want to look at the distribution of another variable within each species. So we're going to keep species ID the same, but we're going to change weight to a different variable. And your choice, and instead of genome violin, we're going to go back to genome box plot. The other part of this is that I'm going to tell you that this time overlay the box plot layer on top of the general layer. So thinking about how you would change this to make sure that the box plots are on top of the points. And I'll give you just a minute to do that. All right. What should I change weight to? Yep. Okay. So we'll get to that. But we're going to do a new plot with the distribution of a different variable. So what variable should I use? Okay. If you use something else, that's fine. And then we're going to do what you said, and we're going to just, I am just highlighting and clicking and dragging so that I have the ditter first and then the box plot. And let's see what happens when we run that. Now we can actually see the boxes over the points. But it makes a little bit more sense. You could change those to violins as well. And then for hind foot length, the violins shouldn't be so super crazy and stretched out as if they were for weight. The next challenge, let's add color to the data points. Instead of just having everything need tomato, let's add color according to the plot from which the sample was taken. So plot ID. So I'm just going to copy this code down to challenge three. And we're going to think about how to change this so that the points are color based on plot ID. I'm going to help you through this. Should we do local or global aesthetics? Local. We just want the points to be colored based on the plot ID. So I'm going to add AES for aesthetics function. We want to specify the color and the plot ID. That's the variable. It's the problem with this plot. The answer could be there's no problem. Nothing lets you do something else to be more visible. Yeah. So right now it's on a gradient, right? On a scale from dark, really dark blue to light blue. Why is it on a gradient? It sees the plot IDs as numbers. So it thinks it's a continuous variable. And so we really need to say that it's not a continuous variable. So this is where, if you were a part of the workshop last time, we talked about factor variables and why we need factor variables and why don't. This is the one case where a factor variable tells it tells R that it's a number that's actually representing a character. And if we put factor, the factor function around plot ID. Now it treats it as distinct numbers. And we can, again, if we change the guide, we would be able to see the colors a little bit more clearly. But there's just lots of different colors because each species, a lot of the species were found in every plot ID. So why violin plots? The five number summary that's visualized in a box plot can really hide different shapes of data. And the same box plot exists for all five of these different plots, but you can see that the data is distributed very differently. And the violin plot is the various shapes that change. And you can see that a violin plot helps you visualize or pick up on structure to your data that it can be hidden by a box plot. Also plotting the points helps pick up on some of the different ways that the data is distributed. So you can look at that code a little bit more. The code is shown just so that you know how we generated these plots. But the moral of the story is that we really like violin plots. All right. So we talked about scatter plots, two quantitative variables. We did a quantitative variable and a categorical variable. Let's talk about just a single variable. We just have a quantitative variable. And how would we plot that? Well, if we just have a single variable weight, maybe we want a density plot. So if we say GM density, let's see what happens there. So I just had to run that twice. So what it does is it looks at all the weights and it calculates this density of how common the observations are and creates a smooth curve. This is half of a violin plot. So a violin plot has a density reflected over an axis. But if we only have a single variable, we can just do a GM density. If we want to color below it to get an idea of the area below a curve, then we can fill the density. Now we're going to use a color sky blue. And we can get a very pretty plot that gives a sense of the area under that curve. Another way of visualizing a single quantitative variable is a histogram. And the name of the GM is just GM histogram. We get a message after we do a GM histogram, says stat bin using bins equals 30, pick a better value with bin width. The author of the tidyverse package does not like unintuitive default values. And he really stuck his flag in the ground with histograms always printing out this message. And really what this is doing is making sure that you look at this plot and figure out if you like the number of bins that are there. So what we're going to do is play around with the number of bins in this histogram. So we're going to copy this code and go to the next challenge. And the parameter that we're going to play around with is bin width or bins. So you can do either you don't want to do both. So bins is the number of bins. Bin width is the width of each bin. So play around with the different values, go big, go small and see what happens. So while you're playing, I'm going to play as well. Like I did. Maybe they're all talking. All right. So that brings us to the end of quantitative variables. So let's go back to the time series data. Should we take five minutes to get water potty break? Yeah. Okay. And then we'll come back and we'll start with our charts. I'm at 405. All right, we're going to get started here in just a few seconds when it hits 405. Oh, and it just hit 405. Okay. So we're going to start with the categorical variable. So again, we're working with a single variable. And if we have a single variable for quantitative variable, we can get a histogram or a density plot. For a single categorical variable, we would visualize it with the bar chart. So the geometry is just geome bar. And we're going to work with species ID just because there's enough of, there's more than two. When sex, there's just males and females. And it's not as many as plot ID. So let's see what this looks like. And let's actually take a little bit of time to think about what this is plotting. So we have the different species on the X axis. And on the Y axis is count. We didn't really talk about this with the density. But what it's doing here is creating a new variable that didn't exist in our data set. We didn't have a count variable in our data set in surveys. So what it's doing is that on the back end, it's doing some statistics to count the number of observations so that it can plot it. We can exploit that and change it. We'll get into that. We're in data wrinkly. We'll just touch on it a little bit now. Box plots also do some statistics, right? They're finding the median, the maximum, the minimum, the core tiles. And you could extract that information out. Smoothers create some sort of smooth density. You're generally using a gam, a generalized additive model. So I guess what the point is is that GG plot is doing a lot of powerful things that you don't necessarily know about, but you are specifying when you specify the geometry that you want to visualize. So here, we could specify the statistic directly. And before we do that, let's think about what a bar chart is doing. The geobar first looks at the entire data frame, then it transforms the data using the count statistics. So it's counting the number of the categorical variable for each level of the categorical variable. Then it returns at summarized data frame, number of observations, the rows associated with each type of species. And then the geobar uses that summary data frame to plot the levels of species on the x-axis and the counts on the y-axis. So we could do the same thing, but instead of saying geometry, geobar, we could say stat count. And that way we're specifying the statistic of how we want the data summarized, but it's still in the GG plot context, so it's not giving us a table, it's giving us a bar chart. So this should look exactly the same as geobar. Generally, we like to think about geometry so that we're always thinking about geometries, but it's just good to know that every geometry has a corresponding statistic. Now, in a bar chart, often instead of looking at the counts, what do we sometimes want? Proportion. Okay, so we're going to, it might yell at us for this, let's see, we're going to change the y-axis. So it's an aesthetic. The y-axis is now going to be the stat. We're going to change the statistics parameter and we're going to say prop for proportion. And we're also going to have this mysterious group equals one, and we'll talk about that here in just a second. If we run that, stat prop was deprecated and GG plot two, 3.4.0, please use after stat prop instead. It still will run, even though it gives you this nice message that says update your code. So I need to remember to update our code. And now what it does is instead of counts, it does proportions out of the total. So the next thing we're going to explore is why we need this group equals one parameter in this code. So we're going to, this is a thinking exercise. So we're actually just going to run this code and then think about what's going on. Putting some variables. Okay, so I changed stat after stat. And that made the warning message go away or the deprecation message go away. But this obviously looks different than what we had before. What is the group equals one doing or what's wrong with this block? So if we don't specify group equals one, it's doing a proportion out of the group. So each group is a proportion of one. So we have to say group equals one so that it's grouped out of the total. And I was just checking to make sure that it's inside the aesthetic that we want to specify. So we should have two clothes parentheses at the end. We put it in the right place. And this is what we really want. So we want proportion out of the total, but it's not a proportion out of itself. All right. We like colors, right? These nice bar charts are nice, but they're not very colorful. They're kind of boring. Maybe make us go to sleep. So let's explore adding colors. So let's start out by just changing the color of these bars. All right. Our naive guess is that we just specify the color of the bar. And let's color based on the species ID, which is the same thing that variable that we're using for the X axis. Let's see what happens here when we do our best guess of how to color the bars. That's not exactly what we wanted. What happened? What did we color? The outline. Yes. We just colored the outline. So how do we color the inside of the bars? What is a synonym for inside? Color inside. Fill. Yes. So let's see what happens if we change color to fill. And again, I'm putting it in the wrong spot. That's okay. You just can't check or answer. Ah, that's what we want. If we wanted, we could do both color. Let's say we want a nice black outline, but we want the inside to be colored. So just like this, it doesn't want to be in the aesthetic. It wants to be here. Just like with the shapes diamond filled. We have both an outside and an inside. So now we have nice bars that are colorful and pop out and really communicate the number of observations in each species. Yes. And then it'll just be local. So if we added another layer, it would. If we put it here, though, we do need to put in the aesthetics. So it is local. And then if we added another layer, it would have just a black color. Or we could change the color to something else. Great question. Any other questions? All right. What about stacked bars? Let's add. This is on this finding single variables page. But let's add another variable. Let's color. Or let's have a stack bar chart. So let's look at the, the counts of each species. But we also want to separate them into if they're male or female and how many of each male species we have, how many males and females. So we want a bar chart with two categories of variables. We could have the bars be stacked. So with the species, or we can have side by side bars. So what we're going to do is we're just going to, instead of specifying the same variable for Phil, we're going to specify a different variable. And the difference here at the top one doesn't have any parameters in the GM bar. The second one tells the position to dodge. Let's see what happens when we do this. Again, top one stacked bars. So we can see within a particular species how many males and females there are. These are actual counts and not proportions. And the second one, they're dodged. So they're side by side. So this is one of those things where you just have to know that dodge is the terminology for side by side. Yes, you could definitely do different colors instead of red and blue. We'll talk about customization here. Yeah. That's a little bit more complicated. So we'll do that on the customization page. The other way we could do it instead of dodge, we could do Phil and let's see what happens if we say fill instead of dodge. What changed here? And it says on the, instead of dodging or stacking, we filled. And so we're looking at proportion out of the total for each species because it is, um, makes sense to have males out of the total for each species. We don't need to say the group equals one. The other thing to note is that the Y axis is wrong. Uh, the values are correct. Zero to one. But the label is wrong. The values are correct. Zero to one, but the label is counted instead of proportion. Um, that it should automatically change, but it doesn't when you specify the position to be filled. We'll change labels again on the, on the customization page. Any other questions about single variables? Time series data sounds very scary and complicated. Um, we're going to talk about it because there are some weird artifacts that happen if we don't specify, um, the right type of geometry in the right way. So, um, we're going to find some problems. We're going to manipulate some data and do some summarization. Um, and then we'll be able to talk about the solutions to the problems that we see. So in data wrangling, we'll go into this in more detail. Um, but to get a time series, um, we could either just look at the data over years. Um, say the weights, but let's say that we want to actually get yearly counts of each, um, genus. So, um, we're going to take the surveys data, pipe it into the count function. We're going to give it two variables year and genus, and then we're going to save it. This is the assignment arrow into yearly counts, and then we'll visualize yearly counts. Um, first I'm just printing it out so that we can see a summary. So in 1996 in this genus, there were 328. Um, in 1996, Neotoma only had six observations. We can go to the last page. 2002. SIG modon had nine observations. And now we're going to plot this. So hopefully, you know, without going into too many details about how to count things, we can just work with those counts. Uh, let's start again with our best guess. We want to plot the number of observations over a year. So that sounds like a line. So let's look at, um, a geo line on the X axis. We want years since it's the time series. And on the Y axis, we want N, which is the variable name for the count that was assigned when we use the count variable function. Let's see what happens here. So this is probably not what you were expecting. This is Ziggie Zaggy line, right? Um, this is because we have multiple species in each year. In this table, we have this genus column that we're not accounting for in our plot. So it's plotting multiple values of N in each year and then trying to connect all those dots together. And it's having problems. Yes. Uh, under the year, it's DBL as that. Double. What does that say? Double. Um, so it's a number. It's coded as a number. Um, and yes, it could be an integer, but it just defaults to double. What does double mean? Uh, it is an integer that can take on more values or a decimal value. It just data storage or computer science. It just uses more bits in order to store that information. And so I think because it has four up to four has four digits, it needs to be a double and not an integer. Uh, anyway, it just defaults to that. It's been too long since I had computer science talk about why it goes there. Um, but because you notice that the count is an integer. Um, I think it's because it's creating the count. So it knows that it has to be an integer. Whereas it's just starts with a year. And when it imports a year, it's just sees a numeric value and gets it double. And then CHR is character. Okay. Good question. All right. So we're going to add another aesthetic. This time we're going to specify that we want it to highlight something. We're going to highlight something. We're going to highlight the column in that summary table. And that's the column is genus. So let's see what happens when we do this. Oops. And I highlighted. Okay. I waited really long to get the highlight bug that we. You highlight something and try to run it. It just runs. You have highlighted. All right. So this is good. Right. Now we have one line for each genus. And this is a line that has a sharp corner there. And so it goes down and then has a sharp corner. Goes up. Or if it continues down and then goes up. So what do we need to do. Add color. So what. The color. Parameter. Is actually grouping. And then coloring. So if you specify group. You get the same effect. But it's all black. No color. So we're just going to change group to color. And. Now we can see. That this species here actually continued on down and then went back up. And then this other species was the one that has the ziggy. Zaggy pattern. And now we can actually, you know, tell us what's going on a little bit better. We don't need to specify both color and group because again, color. Is first grouping and then picking a color alphabetically. Any questions about that? All right. Well, what if we want to look at another dimension? Or what if instead of having all of these species on the same plot, we want them to have their own plot. That's called faceting. So we have plots that have the same axes, but have different data in each of them. And there are two ways to do that. Faceting. We can facet wrap and it'll pick how many plots we have in a row and how many columns we have based on how many things items there are and what we're fasting by. We're fasting grid. We can specifically say we want this variable to be in rows and this variable to be in columns. So facet wrap is good. If you have one variable and you don't really care how it wraps around, you can do that. We can do that. Facet wrap is good. If you have one variable and you don't really care how it wraps around, facet grid is good when you have two variables. There is a couple of different ways to specify faceting. I am going to use the tilde notation. We haven't used tilde yet today. And some people say squiggle. And those of you that are online, I am really sorry. I'm holding up a keyboard and you won't be able to see that. So we're going to use that for a function notation. So we're going to use that for a function notation. Tilde is shift the key next to the one. So that's a squiggle line. And we're going to use that for a function notation. And you would use that if you were to write a linear model. Or have any other kind of relationship. And so we're going to use that here because we want to, we'll read that as the word buy. So we're going to facet by our facet variable. If you do facet grid, we'll have a row variable. Buy a column variable. So let's facet by genus. So we're just going to add another layer. We're not coloring by genus anymore. So we took that out. And I'm going to actually highlight. We're going to go back. This time I intentionally highlighted and ran. It's back to that Ziggy zaggy, which we don't like. And instead of coloring, we're going to have one plot for each genus. And so facet wrap facets equals Tilde genus. And there's all going to be black, but now they're going to be in separate plots. Notice facets fix the axes to be the same. So X axis, all of that would go from 1996 to 2002. All of the Y axes go from zero to 1200. And only the X axes are plotted on the bottom most plots. The Y axis is only on the left most plots. So that condenses the information. And you just know that it's shared for the inner plots. What if we also wanted to group by another variable, let's say sex. Okay, so we're going to go back to our summarization table. And we're going to add in a third variable. So we're going to summarize or count by year species ID, this time instead of genus and sex. And then we'll look at that table. So we now we have four columns year species ID, sex and N. And we can see that 1996 DM females were 188 and DM males were 296. And then we can go to the last page. And we can see that 2002 SH females had two and SH males had seven. And so there's some ordering going on there. So it's ordering based on year, then species, then sex, and then counts. We'll talk about that in the data wrangling workshop. All right, but let's visualize this. So this time let's color based on sex and only facet based on species ID. So we're just going to do a facet wrap. So now we can directly compare in the same plot, the counts of the males and the females for each species. And we can see that some of the species have more variability in males and females, but a lot of them are pretty similar. Males and female counts of these, these plots. But if we wanted to pull them out so that they were separate plots for males and females, we would we're going to work towards that. Okay, so we're still going to keep coloring by males and females within the plot. But let's say that we want to specify that species are all going to be in a single column. So in facet wrap, we can use the parameter n call for number of columns. We can say that's equal to one. This should give us a very tall grid of plots that are very squished because it needs to try to get them all in one space. But we could then look at compare all the species if we need maybe more space under x axis. We go the other way around. We could say we want all of the plots in a single row. So n row is numbers of rows. And this is probably, we can't see some differences in the species because we have more height. But it's pretty hard to read because there's so many plots in columns, so many columns. When we're setting this up so that we can think about how to use facet grid. So now what we're going to do is we're going to have a grid of plots. We're going to have sex specify the rows and species ID specify the columns. We're still going to change the color by sex even though they're going to be in separate plots. So all the plots on the top are going to be in red and all the plots on the bottom are going to be in blue. We can still compare males and females, but maybe it's less obvious the differences between them. Any questions about that? Here's a challenge. I want you to work on this. Use what you learned to create a plot that depicts how the average weight of each species changes through the years. Play around with which variable you facet by versus plot by. We're introducing a new summarization, so we're not going to make you do that yourself. So here we're grouping by year and species that we're summarizing to get the average weight for each species. So you don't have to worry about doing that. It's created for you. But you want to change this code down here so that you're plotting the average weight of each species over years. And while you're working on that, Sarah is going to come up and talk about how to make your pretty plots even prettier. Yeah, let's give you a couple minutes just to try that challenge seven. Yep. I suppose I should help you out with the answer. What time did you say we were going to go by? Go to. To five, right? Yeah. But maybe like four or 15 and then we'll do the. No, no, no. How long did you say that they could have for challenge seven? Oh, just a couple minutes. Okay. Okay. So just run. Yeah. Yeah. Just make a fine. So there's something wrong with that. I'm suggesting that you need it as a kind of staff. Oh, okay. I don't have access to it. Yeah. Yeah. Yeah. Yeah. Oh, Okay. It's over this. You should be able to go to the stresses. Oh, yeah. Sorry about that. Oh, yeah. So, let's go ahead and answer this. So, I have the data summarized down here. We have a start to the plot, but we have to fix it. So, x equals year. Year is the variable in our summarized data, so that's fine. We say y equals n, but do we have an n variable instead of n average weight, and we want to color by species ID. Then we can play around with what we facet by. So, here we're faceting by species ID. We do facet by year. We want to try to even facet by weight, probably not, since it's a quantitative variable and not categorical. Let's see what our solution here was, probably just a copy to click forward, get it back to a bit state. So, here we're not faceting in the first plot, and then the second plot, we're faceting by species. Any questions about challenge seven? All right, Sarah. Cool. So, then what you were asking about about changing the colors for male and female on that one, the barred plot that we had, this is kind of a way to start doing it. So, there are these themes that you can load into ggplot. Usually it just prints like a black color with a white background. That's like the standard, but you can change the themes. So, here we can use, or it uses the black with like a gray background is the standard, but we can change it to black and white. So, on yours, you'd probably be able to see it's like cleared out the backgrounds. It doesn't have those gray marks anymore. And so, that's just, this is like yet another layer on your ggplot where you say theme underscore black and white. So, you can try that. And then there's a bunch of other themes. We have links to, I don't want to use the computer so well. We have links to cheat sheets. So, if you go to this link here, you can see there's a bunch of different types of themes that you can access. You can try the minimal theme. Let's try that one. You can try the light themes. There we go. There's the minimal. So, that like takes many of the different lines away. So, you can play around with that. Actually, we're going to ask you to do so. You can take that plot that you used in challenge seven that you made and add different plotting backgrounds using the theme, if you like. You can figure that out, I'm sure. And then we'll do customization. So, click to the next section. Here's another cheat sheet. I'm going to open that in a new tab and hopefully it doesn't take too long to load. I'll let those of you online, you can click in and see it yourself. But it's this link at the top. But there's basically, this gives you a bunch of different cheat ideas for ggplot. So, lots of the different pieces of code that we've gone through are going to be available there. But let's talk about plot labels. So, if you want to change the names of your axes to something more informative than just like the term you were using in your plot, like year and end, you can use this labs. It's a code. So, here the last function. So, what we've done here is we've used the labs and then in parentheses, you can add a title in quotes, you could add an axis label. So, here we've written your observation. You can have the label for the y axis and you could even create like a label for your key over here. So, this just makes it more human readable as you make your plots. And then also, you can use this backslash and inside your title if you want to make a line break. So, if you have a really long title, maybe you want to create a little line break in there, backslash. There we go. So, if you need to wrap text, you can use that backslash. And then you can also change fonts. So, here's another example. Well, here you may have to install this extra font package which would allow you to. You can do that in our studio, but there are also some fonts that are just included. So, here in this code we're changing the text size by putting theme and then text. Text size is 16. So, then we can change it. So, it looks bigger, but these are just ways to just adapt your plot so that it looks good. And then try again. You can see here we're actually, because we made the text bigger, we're having more overlap on these numbers at the bottom. And so, you could swap the orientation of the labels so that we don't have to, let's see, try this so that we can see them better. So, here we go. We put the year on the y-axis and the number of rodents on the x-axis so that you can see these labels better. And then, but also you can specify the text size independently. So, when we said text size 16, it caused all of this to get bigger. So, we can instead see here in the theme function, you can specify specifically the x-axis text is 10, the y-axis text you could make a little bigger. So, everything is super customizable in ggplot. Oops. This is where you can really have some fun with your plot. And then our next one is about the legend position. So, right here we have it on the right, but maybe you want it to be more different. So, you can use this legend position to the theme and in this case, we're putting it at the top. So, let's try that. And let's try left. There you go. For many of these things, will it be on that cheat sheet? So, if you're trying to do this on your own. And then, on my screen, you can't see any grid lines, but maybe on yours you can. So, we're going to try to remove them. So, by default, ggplot contains both major and minor grid lines, but that might look busy to you and you might want to remove those. So, you can do that here by using this axis line and the grid like panel grid, main, minor, panel grid, major. So, you can see, oops, you can see here how this is working. So, we've got the major grid lines and the star color and everything else much lighter. So, you can play around with how you like your grid to look. So, and here's a breakdown of all of these. So, the axis line option, that says what color the x and y axis are. So, you can change it to a different color if you like. There we go. Changed it to red. We've got the panel grid major removes the major grid, minor removes the minor grid. So, that's the one between the x and y axis fix. The border, background, all of that. Then, you can actually create like your own color scheme. So, this is where we would do change those like default colors that's made on the sky blue that we were seeing in the male and female pots earlier. So, here we have created a couple different palettes for you. So, you can use hex codes, which is what we've done here and add in and then there are tons of these palettes as well that are available online that you can just add in yourself. But if you run these codes, you can see here that we've used this color deficient friendly palette. There's one that's gray, there's one that's black. And so, by adding that in and using this color scale, that scale color manual and adding these values, you can see here we've created a box plot that has custom colors. So, you can play around with this on your own and there's also some packages that will automatically bring more colors into your gg plot. You've got our color brewer and there it is and gg sci. So, you can try installing those packages too and playing around with colors. So, let's do challenge nine. Let's give up another five minutes, three to five minutes to either improve one of the plots that you made before or create a new plot and try some of those different colors. So, this is your play time at the end of the workshop. Then your end and we also put a link to that cheat sheet again if you want to look for inspiration. I'm getting like an error for the cheat sheet. We didn't check our links before. At least the top one was working. So, if you scroll all the way to the top, Helen's working. Sorry about that. Yeah. Don't click the one under challenge nine. Click the one at the very top of this page under customization. I know. It's a pretty plot so weird. So, you can go through the tutorial on your own too and we have a video posted online of our last workshop. Maybe you could like see the beginning as well. Yeah. So, try adding that as a layer. Like you could copy that. Yes. Right. You can do that. You can specify the specific color, right? Like if you wanted to change, we had a question about the colors. Like if you wanted to change the female to blue and the male to orange, could you either go by order or you can add a color variable in your data frame that specifies the color. So, if you want, you can do this whole color palette and sort of get what you get. But if you want a specific color, you'd have to add a color variable to the plot. All right. Let's get, let's keep going. And we'll talk about arranging plot. So, you can use faceting, which we just, we used before to split one plot into multiple plots. But you may also want to produce a single figure that has multiple plots within it using different variables or different data frames. So, you can use this grid extra package, which if you were using our studio, you would install that. And that can allow us to combine separate GG plots into a single figure using this grid arrange function. So, scroll all the way down to see all the code here. So, here we're using grid arrange. We've created a count plot and we've created weight box plot. I think those are the only two. So, we're going to try to arrange those together using this grid arrange function. So, there we go. This is like if you wanted to publish this in your paper, say what are the weights? What are the, how many of each of these species are here? This is a way to print two, two different plots next to each other. So, give that a try. And then our last step is going to be exporting the plots. So, once you've made your beautiful plot, you want to export it, you can save it in any file format you would like. And so, you can use the export tab on the plot in our studio, but that saves your plot at a lower resolution. So, you want them to look as good as they do on the screen. So, here we've written the code that you would use in our studio. It doesn't work in our tutorial here, but you can use this copy paste into our studio by using gg save and save it with the file name that you like and the file type. So, give that a try when you're doing this at home. And we also have a whole tab with like some other fun like interactive graphic stuff that you can try at home too. We're not going to do this here, but you can create like these pop-ups. So, if you're publishing online, you can create these pop-up windows that give more information about your data. And I have one, a couple more. So, that's the end of our workshop part. So, I hope that you learned some interesting new functions that you can use in your work. And then, I also wanted to tell you about this page. This is probably where you signed up, montana.edu slash data science slash training. So, you can register for a workshop there. And then, you can also see we have old workshops, recordings that are here. So, if there's something you missed, you want to go back and check on. You can check on these, or if you're not able to make a future workshop. And we try and keep those with the most up-to-date workshop.