 Hey folks, if you've watched any of my episodes in the past, you would be forgiven for not realizing that R is first and foremost a statistical programming language. I mainly use R to make pretty visuals and to summarize data and do all sorts of data manipulations. But it's also very powerful for doing data analysis with statistics. And don't get me wrong, I definitely use it for its statistics functions. I just haven't kind of displayed them here in these episodes. Well, in today's episode, thanks to a loyal viewer of the show, I am going to build a regression model relating the amount of snowfall to the amount of precipitation, right? And so my understanding is that they take snow and they then allow it to melt and then they record how much precipitation there was for the day. And that's how we're able to get precipitation data that's on kind of an equal footing all over the year, okay? And so there's a rule of thumb out there that I've seen on various websites that it's maybe 10 inches to one inch, right? So if you have an inch of rain that's basically 10 inches of snow and vice versa, I think the viewer said it was maybe 12 to 1 or something like that to where he lives. Doing a little bit of digging, you see it's also dependent on temperature that depending on the temperature, the snow will either pack more or less, right? So that's what I want to look at today. What is the relationship between the amount of snow and the amount of precipitation? And can we model that also taking into account the temperature? As always, I'm going to do that in our studio. I've got my RStudio session fired up and ready to go. We are going to build off of the code that is in code local weather.r. This goes out to the NOAA website and pulls down data for a NOAA weather station that's close to where I live that has 130 years worth of data. If you want to get that script so that you can plug in your longitude and latitude for where you live and you can kind of work along with me, I strongly encourage you to do that. So go down below in the description. There's a link to a blog post that gets you everything you need to get up and running. I'm going to go ahead and run this script to get the data and also load tidyverse meta package, as well as lubricate and the glue packages. So the data frame it brings in is called local weather. So that brings in our table that's got the date, the maximum temperature for each day, the amount of precipitation and the amount of snow. I'm going to go ahead and remove all of the NA values. So we'll do drop NA. And if you don't give it any arguments, it looks across all the rows, all the columns in each row and removes the row that has an NA in any column, we then see that we've gotten rid of those NAs. Also, I'm only interested in those days where there was snow, right? I'm not going to calculate much snow in the middle of July when even in Michigan, it doesn't snow. So we'll go ahead and do a filter to do snow greater than zero. And so now we get those 4,830 days over the last 130 years where there has been snow in and around Ann Arbor, Michigan. So I'm ultimately might look at the day level. But for now, what I'm going to go ahead and do is look at the annual aggregate to see if there appears to be any kind of association at all. We talked a lot about correlations and ways to view those a few episodes back. So to aggregate by the year, I need to go ahead and extract the year. So we'll go ahead and do mutate year equals the year function on the date column, giving us this column year, right? And then I can do a group by year, summarize, and I will then do prcp equals sum on prcp, that'll be the total precipitation for the year. And then snow will be the sum on snow, right? And so that's a little bit weird because the snow starts in the preceding year, right? So snow might start October, November, and go into March, right? So you kind of span the change in year, but whatever, we won't worry about it so much. We're kind of looking at a pretty broad view here. This gives us the year, the precipitation, and the snow. What we've seen in the past, we can look at this a variety of different ways. So I'll call this prcp underscore snow. And then I'll take prcp snow, and let's make a facet out of it. And so to do that, we need to get prcp and snow in the same column. So we'll do pivot longer, minus year, that gets us the three column data frame, ggplot, aes, x equals year, a y equals a value. And then we'll do geom, a line. And facet wrap, tilde name, that's the default name for the column names, the column that holds the column names, right? And then we'll do n call equals one. And of course, we need to do scale equals free y to free up that top y axis. And in here, we've got that 1891 data, that's the very first time point, which is basically down at zero for the amount of snow. So I'll come back up here. And we'll have to do this after we created the year. So I'll do filter year not equal to 1891. And I probably also don't want 2022, a year not equal to 2022. All right, so that looks better. And so here, I can see something going on, right? So we can kind of see, you know, it's a little bit flat to descending up to 1960 for both snow and precipitation. And then after 1960, they both go upward. So another thing we've seen, we could do is go ahead and plot it as a scatter plot, ggplot, aes, x equals, let's do prcp, y equals snow, geome point. And there's a pretty obvious correlation, right? Let's go ahead and add in the year as a color. Yeah, and so certainly, as we get these lighter shades of blue, we're moving into the present period. And we can do core dot test, prcp, snow dollar sign, prcp, comma, prcp, snow dollar sign, snow, my fingers feel fat, I'm not typing very well today, I don't know why. And then we'll go ahead and do the Pearson correlation. And so if I pop this open, we see that the correlation is about 0.83. The p value is significant. And so precipitation and the snow is obviously correlated. Now I want to go about modeling the data, right? And so I'd rather do this at the daily level. And so maybe what we can do is I'll go ahead and call this annual. And we'll go ahead and replace that in here just so that we can have all of the code, all the data available to come back to in the future. And so I'm going to go ahead and bring all this stuff down to do it on a daily level, right? And so to change it to the daily level, I'm going to go ahead and remove the group by and summarize. And now what we can see is we've got our date, the tmax, prcp, snow and the year. I don't really need the year going forward. But that's okay, or the date really, we'll go ahead and leave it in there for now. Maybe what we can do is go ahead and repeat this scatter plot view, right? And so we could do a gg plot as x equals prcp, y equals snow, and then color equals year. Although Geo point there does appear to be a relationship. There's also some places where we have a lot of snow, but not much precipitation. And other places where there's a lot of precipitation, but not much snow. And so again, one of the things I remembered hearing reading as I was doing some research on this is that temperature matters. So let's go ahead and change instead of color by year, let's do color by tmax. We get this kind of funky gradient that goes from like black to light blue going through zero, which isn't super helpful. One thing I'm wondering though is that zero is zero Celsius, right? So above zero, things start to melt. And so I think I'm just going to guess that that's going to be kind of funky. So I'm going to make the executive decision of removing any days where the maximum temperature was above zero. Because it's not uncommon to get days like in March, I know where we'll get some snow, it's like flurries. But it like melts on contact, right? Because the temperature is still, you know, relatively high. So what we'll do is let's go ahead and do snow greater than zero, and tmax less than or equal to zero. So that really cleans things up. I'm not sure how many points are overlapped here. I guess we can look at that here, right? We can go ahead and run everything through the filter. And so you've got about 2,600 different observations where we had a maximum temperature at or below zero, where there was snow. Okay, good. So I'm confident that there's something going on. So let's again call this PRCP snow daily. And I will break up the pipeline to be that. And then we'll go ahead and have PRCP daily to make the scatter plot. Good, so I'm going to go ahead and remove the legend because I'll know that the darker points are more negative. So let's do show dot legend equals false gets us a bit more room to play with. All right, so I think there's evidence that there's a correlation. Again, we can repeat the correlation analysis that we did up here with core test. I'll bring that down. And again, instead of PRCP snow annual, we want daily. And again, I think our correlation before was like 0.83, it's gone up to about 0.87. So there is a strong association between snow and the equivalent precipitation as as obviously, hopefully you would expect. So how can we start to model the data and visualize this interaction or the association between these variables? Well, one thing we can certainly do is come in and we can then do geome smooth to fit a smooth line or curve through the data will do se equals false. So that gives us kind of a saturation based curve. I'd rather have a line a straight line. So one way to get a straight line through the data with geome smooth is to change the formula. We see here that it basically fits a spline or polynomial through the data by default. And so I can come up back to geome smooth and do formula equals and then in quotes, y tilde x, and then I'll do method equals lm for a linear model. And so that then gives me a straight line through the data. One thing that I'm noticing if I look really closely, though, is that the tip of that blue line doesn't go to 00, right? So if we have zero snow, we should have zero precipitation, right? So to fix that, we can add plus zero to our formula. So now we can see the ever so slight change, right? If I toggle back and forth between these, you can see that the line in this case goes right down to 00, because we added that plus zero, whereas this allows the intercept to to vary, right? Cool. Okay, so that is the smoothed based on the linear model with explaining the amount of snow based on the amount of precipitation, right? Something else we could add to this would be like geome AB line. And so that is a line where you can define the intercept and the slope. So I'll say intercept equals zero and slope equals 10. And so we see that line going through the data. Maybe what we could do would be to go ahead and make that maybe a little bit thicker, we could do size equals one. So that rule of thumb of 10 to one actually does a pretty decent job. It's not so far off from the linear model of predicting snow based on precipitation. What I'd like to do though is incorporate temperature. And so the challenge with geome smooth at least is that I can't add or say like do y till the x plus temp or T max, right? What I need to do is build a model outside of this gg plot pipeline and then bring that model in and render the data through geome smooth based on the model that I predict or generate outside of gg plot. So how do we do that? Well, we're going to do some statistics. What do you know, right? So we're going to do LM. So that's the linear model function, the same method that we use down here for GM smooth, right? And so we'll take snow and we'll do tilde. So this is a formula notation where we take snow and explain it by T max times PRCP, right? And so that T max times PRCP tells LM to look at the main effects of T max and PRCP accounting for each other, as well as their interaction. And then we'll do data equals PRCP snow daily. This then gives us some summary output. It's not super clear what's going on, but we see the coefficient. So the intercept of 7.7. So I need to get rid of that. So I can go ahead and change my model to do plus zero. So what we find is the rule of thumb actually isn't that far off, right? So 8.6 units of snow gives you one unit of precipitation, right, which is pretty close to 10 to one. And, and that's going to be at zeroes degrees Celsius. As you decrease the temperature, you're going to increase the amount of snow that you would get for a unit of precipitation, right? So it's kind of the opposite effect, right? Cool. Another thing we could do with this would be to assign this to a variable, I'll call this a snow model. And I could then do summary on snow model. And again, looking at the table, we find that the main effects of T max and PRCP are significant, they have very small p values. But the interaction is not significant. So we don't have to worry about the interaction, interactions make things just horribly complicated to deal with. And again, as we saw before, we have a pretty strong correlation coefficient. Good. So now we want to bring the model into our data. And so how do we do that? Well, what we can do is we can add the fitted data into my data frame. So these lines in the plot that we've already seen are basically predictions based on the amount of precipitation, basically how much snow would there be for that much precipitation, right? And so to do the same thing based on snow model, what I could do is the predict function on snow model applied to PRCP snow daily, and then this will output the predicted values for each row of the data frame, right? And so ideally, this would be a column in the data frame. So we'll go ahead and do that. And we'll do mutate. And we'll do predicted equals that predict function, right? And I think I've got the right parentheses. And now what I can do would be to add to all this a geome smooth, and I'll do AES y equals predicted, right? And so that's the column that I'm going to be mapping into the aesthetic for geome smooth. And again, I'll do se equals false. And let's do color equals red. And so that line looks a little bit different than the simple line that we did, just based on precipitation and snow. Let's see what would happen if we had done our snow model without T max. And so we see that the estimate for PRCP is about nine, whereas when we added temperature, it was like 8.6, right? So I'm going to go ahead and put T max back in there. So this is pretty cool, right? We can fit data using linear model. There's also a general linear model, there's also nonlinear models, right? We could make this model a lot more complicated than we already have. It looks like it might be a little bit nonlinear, there might be some quadratic function that might be able to fit the data a little bit better, especially for precipitation values that are less than 10. So anyway, but again, we can fit the model with linear model, we can look at the coefficients with summary, and then we can then plug all that in to our PRCP snow daily, and then plot that with geome smooth. So let's go ahead and clean this figure up a little bit to make it a bit more presentable. So I'll go ahead and remove the color by T max. Let's also add some labels to our x and y axis. So we'll do labs, x equals total daily precipitation. And again, that's in millimeters, and then y equals total daily snowfall in millimeters, right? Let's also go ahead and use theme classic. So those points are now black, which is the same color as the line. Let's go ahead and change those to be a light gray. So let's go ahead and remove the legend equals false because there's no legend anyway, we can do color equals light gray, and that looks pretty nice. One thing I don't like about the geome AB line is that extends out beyond the boundaries of the plot. That's kind of annoying. So what we could do instead would be to do geome segment. And instead of all the intercept in the slope, so we'll do x equals zero, y equals zero, and then x and equals, let's say 40, and then y and equals 400. So that's better. It would be good to have kind of the maximum value for total daily precipitation. So again, we have that data frame. And so we could get the max of that, right? So that and this and you're probably having a hard time understanding what I'm saying. So what we'll do instead of 40 and 400 would be to do PRCP snow daily dollar sign PRCP. But I want the max of that, right? So we'll do max on that. And then y end is going to be 10 times that, right? So let's go ahead and do that. So we'll do 10 times that max. And so we now see our lines starting at zero and going out to the maximum value of whatever that point is on the x axis. So I would like to have a legend for these three different lines. And something I could do would be to add an aesthetic for the color, right? And so what I'll do is AES color equals and then in quotes, here, I'll put simple, and then for GM segment, I'll go ahead and add AES color equals rule of thumb. So I'll get rid of this color equals red. And then in here in this aesthetic for GM smooth, where I'm using the predicted, I'll then do color equals advanced. And now I have a legend that I've created, based on forcing the names or the groups to the color, right? And so now what I could do would be to go ahead and do scale color manual. And I'll then do name equals null, because I don't want to name for the legend. I'll then do breaks equals. Let's do rule of thumb. And let's do simple, and then advanced, right? And then labels of 10 to one rule of thumb, simple model. And then we'll do advanced model. And then our values will be the three colors, right? And so I'll do for the rule of thumb, let's make that black, the simple, let's make that blue. And then the advanced, let's do red. And we get a legend and our three colors. Of course, we can save this out to figure. So we'll do g save figures, model snow ratio dot PNG. And let's do width equals six, height equals four, very good. We now have our legend going with our plot. We could maybe move the legend around, but I think it's okay for our current purposes. Again, what I want to drive home in this episode of Code Club is that R is a statistical programming language. And we shouldn't forget that. There's all sorts of complicated things you can do with statistics within our really encourage you to explore that further. Hopefully if you're taking a statistics class at a university, by now they're using our because that's probably what the professors are using, right? Anyway, encourage you to play around with this type of approach of modeling the data, really appreciate the viewer who suggested this idea that they went and looked with their own data from I believe Indiana of what the ratio was between snowfall and precipitation. I think they found a value of about 12. So again, maybe it varies by geography in addition to temperature. But yeah, let me know what you find works well for your data. I'd love to know what kind of ratio or what kind of coefficients you're getting for both the precipitation, as well as for the T max values. So play around with this, try doing this type of work with the data you're using for your job. And we'll see you next time for another episode of Code Club.