 Hey, folks, one of the really powerful things about working in the tidyverse is that it's so straightforward to use tools like dplyr and ggplot2 to explore your data. The data exploration tools are just really just so powerful. And while I just geek out on making pretty figures and trying to make things look publication quality, the really good thing about ggplot2 and dplyr is the ability to quickly iterate through a number of different types of analyses to better understand what's going on in your data. That's what I want to show you today is a few strategies that I would use to look to see if my data are correlated with each other. In this past several episodes, I have been looking at local weather data from outside of Ann Arbor, Michigan near where I live to try to get a sense of kind of global climate change as it affects my local area. Some of the data that I've been working with are precipitation data as well as temperature data. And so what I want to look at today is what have the annual trends been like in precipitation and maximum temperatures over here in our studio, we're going to get going with a new R script that I'm calling tmax prcp correlation, tmax and prcp are the variable names, the columns in the data frame I'm working with. When I run this source function, it runs an R script called local weather.r in my code directory that goes up to the NOAA website and pulls down the data for my local NOAA weather station. If you want to get a copy of your own data, then I encourage you to go back and look the previous episodes in this series, and also to go ahead and get a copy of this project. For instructions on doing that, you can go down below in the description to a link to a blog post that will get you everything you need to get up and running. And of course you can go into local weather.r, insert your latitude and longitude to figure out where you are. All right, so this has loaded and run, of course, it also loads the tidyverse lubricate and the glue packages. And so if I look at local weather, I see that I get a data frame with the date the tmax precipitation and snow. Some of you I know don't know what snow is, but trust me here in Michigan, we get a lot of that. All right, so I'm going to go ahead up here and take local weather. And what I want to get is the average annual temperature and the average or the total precipitation by year. Okay, so we've seen this a number of times already, but we can do a group by. And I want to group my data by the year, but to group it by the year, I need to extract the year from the date column, right? And so nicely, we can use the mutate function to do year and run the year function from the lubricate package on the date column. Great, that gives us our year. I can then pipe this to a group by a year. And I can do a summarize tmax for the average tmax, and we'll do mean tmax. And then we'll do PRCP as the sum of precipitation over the year. And so what naturally happens is we get a bunch of NA values. And that is because when we looked at the previous version of the table before we aggregated by year, we had NA values. And so if you calculate the mean of a column that has NA values or the sum of a column that has NA values, the functions don't know what to do with that, right? So what you could do is go ahead and remove that NA before you calculate the sum or before you calculate the mean. And so you could do that with something like na.rm equals true. So let's go ahead and do that approach. Alternatively, what I could have done would have been like drop NA on tmax and PRCP, right? I'll go ahead and leave that in there. But what this would have done is to remove those rows where tmax or PRCP were NA values. I'm going, I'm not going to do this drop NA approach. The reason I'm not going to do this drop NA approach is that this removes the entire row if either tmax or PRCP were NA values. And so we might not have precipitation data, but we might have temperature data for that day. And so what we do here with mean tmax NA, NA.rm equals true is we remove the NA value before we calculate the mean. So let me show you a quick example. So if we do define a vector x, and let's do NA 234, right, and I have x, if I do sum on x, that gets me NA, right? That's what we'd expect. But if I do sum x na.rm equals true, so you'll see there in the pop up that the default is false. Then again, it removes the NA values before calculating the sum. And there we get nine. And it does the same type of thing for the mean. Running that, we now get the year, the tmax, and the precipitation. Now what I'd like to do is look at a couple different ways of visualizing these data to get a sense of whether or not the average temperature and total precipitation for the year are correlated with each other. So I will go ahead and call this tmax prcp. And we will use that data frame now for a variety of different analyses. So the first thing I like to do is let's go ahead and make a facet where we give them the common x axis by year. So we can do tmax prcp. And again, what we have in here are three columns where we have the year, the tmax and the prcp to make a facet where we have one window for temperature, one for precipitation. I need these numbers for tmax and prcp to be in one column. And I need tmax and prcp the names to be in one column. We'll do pivot longer on everything, but the year. And so now this gives me a year column, a name column and a value column. We've seen in other episodes how you can specify the name and the value. But let's not worry about that so much right now. We can pipe this to ggplot, aes, x equals year, y equals value. And then we can add to this would be geomline. And we can then add facet wrap. And I'm going to go ahead and then facet by name. This then gives me two side by side panels, prcp and tmax. That's not exactly what I want. I want them on top of each other, because if I have a common x axis by year, then I can kind of more evenly more easily see the change in these two variables over time. So to facet wrap, I could add n call equals one. So now we get one column with precipitation on the top and tmax on the bottom. Of course, we see that both panels are getting the same y axis scale. And that doesn't make sense for tmax, because we would never have 1000 degrees Celsius, right? I'm not here on earth, hopefully for a very long time. So what I can do is I can add to facet wrap scales equals and then free underscore y. And so this will free up the y axis scale to vary by the observed data. And so now we see that tmax has a much more plausible y axis scale. And we can kind of see how that changes over time relative to the precipitation. One thing I noticed is that I have my 1891 and 2022 data on here that I would probably like to go ahead and remove because we don't have full years worth of data for those. And so the total precipitation and the average temperature for those years is going to be a little bit screwy. So I'll come back up here and I will then do filter. And I'll say year, not equal to 1891. And year not equal to the year on today. So the year on today should give me today's this year rate 2022. And so now when I feed this into make remake tmax PRCP and regenerate my plot, I no longer have those kind of funky values at the end. And so what I see, I don't know that we've looked at the precipitation data this way, is that it's relatively flat, and then increases in total precipitation over time. Maybe we could see this a little bit more easily. If we add geome smooth, and I'm going to go ahead and remove the standard error clouds, I'll do SE equals false. And so again, we can see that in general, the tmax has been increasing over the past 130 years. And the precipitation has really been going up over the past 60 years as well. So this plot gets me thinking about something that's generally considered a no no in data visualization, which is having a dual y axis. And so we could put tmax on the left and precipitation on the right. Let's try this and see how good or bad it might look. So we'll come down here and we'll do tmax PRCP. And I'm going to start by making the temperature plot, right? So the ggplot aes, and on the x axis, we'll put year on the y, we will then put tmax. And I will then do geome line. And I'm going to make this color blue. Very good. There's my temperature on the y axis. I will also call this tmax plot. And now I want to take tmax plot and add to it data for the the precipitation. So we'll do geome line. And we will do aes y equals PRCP. And I will then do color equals red. And so of course, we saw this before when we made the facets, is that the data are at very different scales, right? Our precipitation is you know, maybe 50 fold higher than what we saw for our temperatures. And so what I could do is perhaps take PRCP here and divide it by 50. So that gets the precipitation and the temperature data to overlap, their averages are close to about the same. Of course, the variations are quite a bit different. I think what I'd like to do is scale them between zero and one. To do that, let's come back up here to tmax PRCP. And so again, we'll take tmax PRCP. And I want to get the min and the max. And so what I'm going to do is I will do a mutate on tmax to subtract out the smallest value of tmax, right? And so then we'll do min on tmax. Maybe I'll call this tmax tr. And so that gives us a transformed version of tmax, right? And then we will then that's the bottom. So we'll have a value at zero. And then we want to get the maximum to be one, right? And so then we will then take all of this, and I'll put that in parentheses. So I'll scale this so that the maximum of this is one, right? And so now I can divide this by the max of tmax. And I need this in parentheses, right? Okay, minus the min of tmax. And so now if I do a summarize to do like min on min of tmax tr and max to be the max of tmax tr. Yeah, it goes from zero and one. Cool. So this is my transformed tmax. So it's between zero and one. I would like to keep track, though, of the max and min values. And so what I will do is this is going to get a little bit messy, I know. But I will then do tmax min as the min on tmax. And then tmax max as the max on tmax, right? And this then gives us our min and max values. Great. Now we want to repeat it for the precipitation, right? So I'm going to go ahead and copy this. And if I was thinking ahead, we'd probably make this into a function. But hey, you know, we're not. So we'll go ahead and change all these pure tmaxes to prcp's get that loaded. Great. So now we have all that. And I can now feed this as my tmax prcp. And so I'm going to call this scaled tmax prcp like that. Get that. And so now I can take scaled tmax prcp feed that into the rest of the plot again is going from x equals here to y equals tmax. But that needs to be tmax tr, right? So underscore tr like that. And so now I see I go from zero up to one, that's great. And I can now do the same thing we did before with prcp, except we will then do underscore tr. And so now we can see that our red and blue lines are snugly on top of each other. So what I will do is I'm going to come back to my tmax plot. And I'm going to add a y axis to this, right? And so we've seen this before in the last episode, I did something like this, where I converted the millimeters to centimeters within scale y continuous. So we'll do scale y continuous. So I want my labels to go from 10 to 30 by fives, right? And so then I need to figure out the breaks that I want to put those at. So if I then do breaks equals seek 1030 by five. And I'm going to basically take this and run it back through the transformation that I did before, right? And so I'm going to subtract out 11.7, right? So we'll do minus 11.7. And then we're going to divide this by 17.5. So that looks pretty good. Maybe what I'd like to do is change my limits to go from like 10 up to 32. And again, I can do the same type of thing I can do limits equals to 32. So again, I'm going to subtract 11.7 divide by 17.5. And I need to open parentheses up here. And I think I need to close parentheses to close out the scale. And so now if I look at tmax plot, I see, yeah, I've got a pretty good range from 10 up to 30 for my temperature. And now I want to do the same type of thing for adding on the line for the precipitation. And so that line now is red. But what I'd like to do is to now get the same type of scaling for the right side. So to get the right side, I'm going to do the same type of thing that I did here. For the tmax plot, but I'm going to do scale y continuous. And I will then do sec dot axis equals sec underscore axis, and then I'll do trans equals tilde period. And so that basically means I'm not going to transform the data. I'm going to doing the transforming in the labels and the breaks. So I will go ahead and copy down labels and breaks down here. And let's go ahead and just make sure that we've got all our parentheses in the right place. It gets a little bit confusing sometimes. And I'm going to want to go from 300 to 1300. And I'm going to go in 200 millimeter increments, right? And so we'll do the same thing down below, right? Add that in. And then we need to give our transforming values. So again, if we look at scaled tmax PRCP, we have 420 and 1298. So I'll do 420 and 1298. And it's telling me that the scale for y is already present, adding another scale for y will replace the existing scale. So it doesn't like that I have two scale y continuuses. So that's cool. I'll go ahead and grab that first one. And we will merge it with the second one, right? And so we'll go ahead and grab that. And just make sec access a second another argument, right? And I need to reload tmax without that first one, reload that now the warning messages go away. And we see that they overlap on each other fairly well. I am noticing that this should probably go up to 1800. So let's go up to 1800 and we'll go by 300 increments. 1800. Yeah, that looks pretty respectable. We can then add to this a name. And so I'll come to the end and do name equals total precipitation in millimeters. And that gives us total precipitation millimeters on the right. We could also add name here to be average annual temperature. And so now we have both labels. And so it might be confusing, which is which, right? And so what we can add to this would be a theme. So we'll do theme access dot title dot y dot left equals element text color equals blue, I think that's what we used up here yet blue for the left. And then we'll do the same thing for the right, right. And this will be red. And so now we have the blue label for the blue line and the red label for the red line. And we have our data right on top of each other. And so the question then is, do you prefer this? Or do you prefer them faceted on top of each other? In general, people prefer them faceted on top of each other. Because this just gets really hard to decipher what goes with the left, what goes with the right. For data like this, where they're right on top of each other, it's really hard to kind of decipher again where the lines are. And it just gets to be a bit of a jumbled mess. Also, there's, you know, funny things you can do if you're like compressing your data too much for one variable one side of the y axis versus the other. So again, that's why I would personally prefer to do things with the facet that we saw up here. I think this is easier to look at than this. There are people that really like to have multiple y axes. In general, the data visualization field I think is kind of sour on double y axes. But know that you can do it in ggplot. It's not totally straightforward to do. If it's not totally straightforward to do in ggplot2, kind of like making pie charts, that's generally the developers telling you, this is a bad idea, right? So again, the alternative is to stack them as separate facets. This would be my preference because then it's more clear what is what. And then you're keeping the values on the y axes for the two different charts separate. This just gets to be a bit of a jumbled mess. And it gets to be confusing to have to look left versus right for which color and remember which color looks which way. And so I think that's why you'll find a lot of people in the data visualization community are a bit sour on this approach. All right, so we've looked at faceting, we've looked at double y axis. Let's go ahead and put each variable on a separate axis. So we'll go ahead and take tmax prcp, which we've been working with, right? And again, we've got the year, the tmax, the prcp. And let's do ggplot aes. And on the x, let's put tmax y prcp. And then I'm going to color it by the year. And let's do a geom point. Very good. So we see that the lighter points are for more recent years. And those tend to be in the upper right quadrant of the plot I feel. And so there is a bit of a temporal progression in the data, things getting warmer, things getting wetter, right? We could but for prcp and tmax themselves without looking at the year, if I go ahead and turn off the color equals year, we find that looks like a big cloud, right? And so this is kind of the more direct test of a correlation, we could add to this geom smooth. And we see that it's like a straight line through it. If we wanted to force a straight line through it, we could do method equals quote lm. And that's like bang on flat, right? So to do the statistical test, we could do core dot test. And we could then do tmax prcp, dollar sign, tmax comma tmax prcp, dollar sign prcp. Again, that dollar sign is another way to get the column out of tmax prcp. This then I'll go ahead and open this up, gives us the Pearson's product moment correlation. We see that we get a p value of 0.76 correlation of 0.02. Basically, no correlation. If you wanted, that's this, that's the Pearson, if you wanted the Spearman, which is the non parametric, you could then do method equals Spearman. This then again gives us a row a correlation coefficient of 0.018 p value of 0.83. Again, you would only do one of these for an actual study. If you wanted to get rid of this error message, you could add to the second one exact equals false. And again, that gets rid of that warning message and doesn't calculate the exact p values. And it doesn't really matter. It's 0.838. It's not significant, right? And so this relationship between average temperature and total precipitation by the year is not correlated with each other. Again, these are a variety of visual ways to look to see if different data are correlated with each other. The first we did was to facet them to take each variable, give them a common axis like date the year on the x axis, and then put them on top of each other and kind of visually inspect whether or not there's a correlation. Similarly, we also looked at how we could go about making a double y axis to plot the data on top of each other to see if there's any obvious correlation that didn't appear to be. But we also talked about the downsides of having that double y axis. Finally, we then took one variable put on the y axis, another variable put on the x axis, plotted the data as a scatter plot. Didn't really see much of a correlation. We could use GM smooth to fit a line through that, we could do core dot test to do a statistical correlation using Pearson or Spearman correlation. Again, in this case, there did not appear to be a correlation. Again, I hope you appreciate the flexibility and power of these tools to do some very quick exploratory data analyses, as well as some of the hiccups you might run into along the way when it comes to thinking about how to visualize these data and whether or not they appear to be correlated with each other. Well, encourage you to play around with this for your data. If you do this with data from where you're from, let me know if you get a different result. If you see some type of correlation, it certainly is interesting that the trends in general show that as we've kind of moved through time, southeastern Michigan has gotten warmer and wetter, even if there isn't kind of a one to one correspondence in those values, the general trends hold, but the kind of year to year correlation doesn't seem to be very robust. Like I said, let me know what you find, and I'll talk to you next time for another episode of Code Club.