 Hey folks when I was first learning about the tidyverse and dplyr and ggplot2 and all the other great tools One of the things that just really captivated me was the ability to use group by and summarize together to aggregate my data And then output a summary statistic like you know, I've got a whole bunch of values for different categories I want to know the average Value across each of those different categories group by summary just fit the bill And I can remember thinking like why do they have group by and summarize as two separate functions? I only ever use them together, right? I would never use group by it by on itself, right? Well, would you in today's episode what I'm going to show you is how you can use group by in a number of different contexts Other than just group by and summarize and so that's exactly what I'm going to show you how to do today We are going to use our weather data looking at temperature data to see a variety of ways that we can use group by To aggregate our data into different categories and then do a variety of downstream manipulations of those data Sure, we'll do summarize, but we'll also do other things along the way. So I'm over here in our studio I have the r-script that I actually wrote for the last episode which allowed me to get data From a weather station close to where I live in southeastern, Michigan If you want this code and everything else that goes along with this project down below in the description There's a link to instructions on how to do that I'll even put a video up here to help you figure it all out I encourage you to plug in your latitude and longitude for where you live So in that last episode we talked about removing outliers in different ways that we could use to detect outliers So I'm going to clean up this r-script to remove a lot of the diagnostic stuff I had from the previous episode so that I can use this script to populate other scripts because I'm going to use My local data for this episode and a number of other episodes So I'll go ahead and remove these manipulations where I was turning certain dates into NA values I'll also remove that drop NA and I need a closing parentheses on my mutate here I can get rid of that filter and then I think all this other stuff was various ways of plotting the data Which I'll go ahead and delete and I'll save my r-script One thing to point out about this is that it does pull data down from the NOAA.gov website at two different places I actually wanted to record this video last week, but when I was going to record it I experienced the downside of using data live up on a website The NOAA website was down so I couldn't pull it down so I was getting an error message So again, that's a trade-off between having the data locally on your computer that it can it's local But it can get a bit dataed Versus using the data as it is up on the web that it's live, right? It's updated to the latest day, but the website might be down or it might go down permanently There's trade-offs, right? So I'm going to go ahead and save this As localweather.r I'll go ahead and create a new r-script and then I will do source and then code forward slash localweather.r What I am doing is I am assuming that I am running my r-scripts from my project route directory Where is my project route directory? Well, that's the directory with my .rproj file, right? And so I don't have to change directories in and out of code. I think I've talked about this a long time ago Maybe I'll put a link in here about why you don't want to use things like ZWD So we'll go ahead and run this so it took a few moments to download the data and get it cleaned up We have this stored as localweather And if I go ahead and run that I see that I've got this data frame with date tmax prcp and snow Tmax is now in Celsius PRCP and snow are in millimeters If you go to the upper right corner this environment tab You'll see the variety of values that I have stored and that these all came over from sourcing code localweather Also, because localweather had in it tidy verse glue and lubricate those are already loaded And I don't need to put those at the top of my r-script Everything is good to go and we are ready to proceed at the beginning of this series of episodes We were using data accumulated by NASA looking at global temperature anomalies and those data like the temperatures were normalized for data between 1951 and 1980 so if you looked at the average temperature between those two dates the average would be zero right and so everything then is Relative to that and so what I'd like to do is start out by making a plot with you that across the x-axis has the years and The y-axis has my temperature anomaly and we will plot the average annual temperature normalized to 1951 to 1980 and again We're going to start out using group I and summarize. So let's go ahead and take our local weather I'll go ahead and do a select on date and T max Just to keep everything simple So I need to get the year out of my date and I can do that using functions from lubricate Which we saw before so we can do mutate year equals the year function on date and What we see now is that we get the year I'm going to go ahead and remove the data from 1891 and 2022 because those are partial years as I'm recording this it's in July and as you can see from 1891 It started October 1st, so I can do that with a filter. You're not equal to 1891 and You're not equal to 2022 now If you're watching this in a year, you'll want to do 2023 But what would be even better than 2022 would be to use my new favorite function, which is today And so we see that today is January or July 22nd 2022 and that outputs it as a date and so then I can do a year on today and I get 2022, right? So I can say this year is That and so then I can plug that into here as this year and now I've removed the data from 1891 and 2022 can prove this to myself by sliding this into tail and seeing that it ends December 31st of 2021 good Now what we want to do is go ahead and get the average temperature for each year This is the traditional way that I learned group by which is to use it with summarize So we'll do group by year and then we'll do summarize And hold on let me run the group by just so you can see what this all looks like And so the output looks identical basically to what we had before There's one subtle difference and that it tells us in the output that our data is being grouped by the year and there's 130 year categories right from 1892 to 2021 that's 130 years and now what we can do with summarize is get the average T max So I'll do T max equals mean on T max And now we get a two column data frame with the year and the T max If I go ahead then and pipe this over to gg plot. I can do aes x equals year y equals T max and Then geo mine to get our line plot So that gives us a line plot a lot like what we saw for the global temperature anomalies But at the same time it's a bit more noisy again when you're averaging over the globe Things tend to average out and smooth out obviously right so again our temperatures go from 12 degrees up to about 17 degrees And what we'd like to do is to normalize it so that between 1951 and 1980 the average temperature in there is zero degrees so we can say you know the year 2022 is so many degrees warmer than it was back in you know between 1951 and 1980 So I'm going to insert after my summarize and again just to remind ourselves what this looks like It's a two column data frame with the year and the T max. I'm gonna go ahead and put in a mutate I'll say normalize range Equals and it's gonna be a logical right and so we'll say year greater than or equal to 1951 and Year less than or equal to 1980 you could pick any range that you want I think in one of the exercises we did we used a more recent period You know you can you can do what you want to do and so we see that for 1892. It's all false Right, so what I might normally do would be to create a separate data frame that gives me the average temperature between 1951 and 1980 and then use that in another mutate to subtract that from my current T max value I don't want to do that. I want to do that all within this single pipeline So how can I do that? Well what we could do would be to say normalize Mean and so what I'll then do will be to do a Sum so the so the mean is the total of the values divided by the total number of observations, right? so what I could do would be to take the sum of T max and so if I did some of T max over N say right that would give me the The average temperature between 1892 and 2021. That's not what I want I only want it within that normalize range so what I could do is to take T max and multiply that by normalize Range and so the thing to know about false values and true values is that false is zero true is one Right, so if I have a T max value and I multiply it by a one because it's within the range Then I'll get that value, right and then I can then divide that by the scaling factor of the number of values in the Normalize range. Well, I can't use the end function But what I could do would be to do sum on Normalize range again if a value is true the numerical value of true is one So if I sum up all those normalize range values, I'll then get the total number of observations This then gives me the normalized mean of 14.9 degrees And so what I can now do is to take T max and subtract that from my normalized mean So I'll say T diff equals T max minus normalize mean Giving us the temperature difference for each year I can then pipe this into my gg plot Changing my y from T max to T diff and geom lines there And so now what we see is we basically have the same line plot, right? So if I toggle back and forth there's a little bit of movement and that's mainly because of The size of the values on the y-axis, but the line itself is identical, right? But what's changed is in this version we have a zero line, right between 1951 and 1980 so the average temperature in there is set to zero and so we might say that for the year 2021 here outside of Ann Arbor, Michigan The average temperature was about a degree and a half Warmer than it was between the years of 1951 and 1980, right? We can look at the overall trend in this by adding a smooth line So we go ahead and do geome smooth and We get a fitted line through this There's a little bit of a dip and actually between the years about 1951 and 1980. It does appear to be fairly flat in there But it seems to get warmer over the first half of the century and then even warmer over the second half of the last century going on into 2020 again what I wanted to emphasize here was how we can use group by and summarize in This context look at the average for each year, right? We can take our data We can break it down into separate blocks separate groups within each group We can then calculate a statistic like the mean like a total like account, right? And then we can feed that into the rest of our pipeline to then go ahead and make a plot like this And we've seen in previous episodes how you could go about going going ahead and cleaning up that figure I'm not going to do that today because I want to move on and show you another way that we can use group by Without summarize for the second figure that I want to generate for today's episode We're going to use group by in a different context. Yep We're going to again use group by and summarize, but we'll do another group by So that we can look at the temperature anomaly over the course of a year So I'm going to come back to the beginning of this pipeline actually and steal a few lines So I'll go ahead and bring that down. And so now we've got local weather our select and our year I'm actually going to leave This year's data 2022 data in the data set if I look at what we've got we again have the date the T max and the year I'm also going to generate the month because I want to look for each month within a year So we'll do month equals the month function on the date and I forgot that I have a pipe down here So go ahead and rerun that and so now we see yeah, January is month one and this is 1892. We're in good shape All right, so now what I want to do is go ahead and group my data By year and month to get the average temperature within each month and year So how are we going to do that? Yeah, we're going to do group by and summarize So again, we'll do group by and now we're going to use two factors to group our data, right? We'll do year and month and Again, if we look at that output we see now that like before we have this groups line in the output But now instead of just having year we have year and month and we see that there's 1558 combinations, so that would be like 130 times 12 give or take because we also have seven or eight extra months for this year, right good So our data is grouped by year and month and now for each year and month combination I want to get the average temperature so to do that will then do summarize, right? And so now I can do T max equals mean on T max This then gives me the year the month and the T max the average temperature for that month I also get this output that the summarize has grouped the output by year And so the way group by works is that when you combine it with summarize It removes the grouping to the right, right? So our data is no longer grouped by month, but it's grouped by year. That's the default behavior I personally prefer to strip off all of the grouping after doing summarize you can do that with dot groups equals drop and So now in their output, we no longer see anything being grouped and our data is totally ungrouped Now what I'm going to do is go ahead and let's pipe this into ggplot. So the ggplot aes x equals month y equals T max and then our grouping For a line will be by year and our color will also be by year And then we'll turn this into a geome line to make a line plot And so now we can see over the course of a year that it's much warmer Say in July than every other month I can say that with assurance because it's about 90 degrees outside here Which is much different than it was back in January or February, right? So what I'd like to do now is again normalize for each month So again between 1951 and 1980 and what I'd expect then is to have a Horizontal set of lines right at zero so that the average of those lines would all be zero And so we're going to normalize for each month So it's gonna be a lot like what we did up above where we normalized by year But now we want to normalize for each month between 1951 and 1980 this again is where we're going to use a group by without summarize So I'll go ahead back up here after my summarize and do a group by month because again, I want to look within each month Within that year range For the average right and so we can then do group by month and instead of summarize I'm going to use mutate and I will go ahead then and say normalized range equals year greater than equal to 1951 and year less than equal to 1980 And so this gives us our false so I didn't have to do this normalized range step after a group by I could have done that before group By but because I just want to have one mutate statement. I went ahead and did it after the group by There's no problem with having multiple mutates. I just want to keep things as compact and simple as possible now I'm going to do normalized temp and So here again, we're going to use the trick that we used up above where I could then take sum of t max times normalized range divided by the sum of normalized range Right and so now I get my normalized temp within each Month and I can double-check this actually by sending this to a filter. So if I do month Equals equals one. I then get all of the January data, right? And I see that the normalized temp is the same for each month and again if you look back up here at the output We see that each Each month has a different normalization factor, right? And so, you know, we could look at that, right? We could do something like GG plot aes x equals Month y equals normalized Temp and geom line So yeah, so this is the average temperature for each month between the years 1951 and 1980. That's not what I want though I want to subtract the normalized temp from the observed or the average monthly t max So now what I'll do is t diff equals t max minus normalized Temp and let's go ahead and see what we get here And so now we see That we have the year month t max normalized range normalized temp and the t diff and again That's going to be a t diff for the month and the year I also see that my data is still being grouped by the month So what I can do to get rid of that because I don't have a summarize to remove the grouping variable for me I can add the ungroup function so I can do ungroup and So now when I look at the output, I no longer see that groups line in my output Cool now we can go ahead and as we've seen before pipe this to gg plot on my x-axis I'm going to put the month the why I'm going to do t diff And then we're going to group by the year and color by the year and make line plots And again What I expect is to have a bunch of lines that are more or less horizontal They shouldn't have any seasonal trend to them And so it should be flat with an average right around zero So yeah, we get a whole bunch of horizontal lines more or less right where the average is right about zero and what this view allows Us to see more easily is the temperature anomalies right and so because every month is basically Normalized to the same average. It's much easier to see wow This year in October had a really low average temperature right that it was about 16 or 17 degrees cooler Than it normally would be in October right and so what we might do is Again, I could plug into this and I could do something like filter month equals equals 10 and then I could do like slice min on The minimum value on t diff and let's return like five values And so this returns then the five years that had the coldest October's and so what I see is it's very interesting to me that the the normalized temp for the month of October was 17 degrees and The temperature difference was 17 degrees cooler and that the t max was zero The average t max for that month was zero degrees Celsius. That seems a little bit odd to me So I'm going to come back up here And I'm going to go ahead and look at the output of these first three lines again This gives me the date the t max the year in the month and again. I want to filter on year equals equals 1950 month equals equals 10 you can use a comma for these I generally like to put the and sign in there and so what I see is that For the first 10 days the maximum temperature was zero. Let's go ahead and look at all of them So we'll do print n equals inf and so what I see is that every day of the month of October in 1950 the maximum temperature was zero which seems weird Right, so I'm not buying that and that makes me remember that if I go back to local weather Then when I did my pivot wider I Filled all NA values with zero and so I think that may have been a mistake So I'm going to go ahead and remove that And assume that if it's an NA it's not a zero So I'm worried that there were t max values that were actually NA's that I'm now changing to be zeros Which is not what I want because if it's a temperature Then it should be have an observation and shouldn't an NA doesn't mean zero right Whereas I think when I went through this I thought that an NA for precipitation would be zero But the more I think about it. I think it's probably best to leave it as an NA so again what I can do is save local weather I Can come back and resource it and then rerun all these commands and we'll see what that does to our October data So when I ran the first it says removed 130 rows containing non finite values That's because there's an a values in there so I need to go ahead in here in my select and do drop and a on t max and Let's go ahead and try this first plot again And so now the error message goes away and we now see That we don't have those error messages good And so now again here in the second plot where we're again looking across the year by each month I can again do drop NA on t max and now I see that there's actually no October data So we'll go ahead and remove all this filter stuff And so now I've still got this filter in here looking at the coldest octobers and I see that 1950 is gone So I'll go ahead and put this back to the gg plot and let's see how that changed our figure So we see that the 1925 and 1917 octobers were quite a bit colder than all other octobers But certainly not minus 17 again We did our best to identify anomalous data in the last episode But as we go along we might still discover things that look off And we always have to approach our data with a little bit of skepticism to make sure Things are you know, they're actually going on what we think are going on. All right cool So we have each line colored by the year I don't find this super helpful just because it's just a big mess of points without any separation So what I'd like to do to kind of close out this episode is to go ahead and color in the line for the current year 2022 And so again, we had a variable variable up here this year, which is 2022 So I'm going to go ahead into my mutate statement to make another variable that I'll call is this year And that is going to equal year equals equals this year And again, if I run all this From the end of that mutate statement back to the line 21 I now see I've got this extra column is this year. And of course if I pipe this to tail I now see yes These years these months in 2022 are part of this year Gid and now I want to remove that tail And I'm going to group it by the year But I'm going to color by is this year and I can now see that I've got this You know cloud of salmon lines with this green tealish line for the year 2022 And I haven't done a whole lot of kind of primping my plots But let me go ahead and do that here just a little bit to kind of show you how I would change the colors So I could do scale color manual And I will then do breaks equals false and true and then my values I will make the false light gray And my true the year 2022 I will then do Dodger blue and let's go ahead and do theme classic And while I'm down here, let's go ahead and turn off the guide so we can then do guide equals none And so now we see our plot where we have 130 years worth of data in gray Again, each year being a different line normalized by month between the years 1951 and 1980 But with this year's data being shown in blue So on the whole my takeaway from this plot would be that so far in the year 2022 The temperatures in southeastern michigan are about what they were between the years 1951 and 1980 Of course, you could come back to this normalized range and change the range for any year that you would want And so I'd encourage you to try that out. Maybe use a more recent set of years to see again How does this year compare to those years those data over the past 20 years again? I encourage you to play around with these plots play around with the data I've been really excited to see the comments you all have given me about your own explorations of the data from your local weather stations I hope you enjoyed seeing different ways that we can use group by both with and without a summarize function to follow it And we also saw in the second approach I did group by and then mutate that I could do ungroup to remove those grouping variables All again, keep practicing with this and I'll see you next time for another episode of code club