 Hey folks, in this episode, I'm going to show you something cool that we can do with a factor as a variable when it's fed into count or group by. A few episodes ago, I made this figure showing the monthly amount of precipitation by month, of course, across all 130 years worth of data that I have from a local NOAA weather station. And one of the things that I found as I was building out this figure was that if I had a month where there was no snow data, where there's just no data because perhaps it didn't snow, then the lines would get truncated for some of the years. Initially, these were recorded as NAs, and then we dropped all the NAs. And I'm going to assume that those NA values really should have been a zero. So if I see an NA in July, I think that's a zero. Anyway, what happened then was that we had year and month combinations where there was no data. Well, that caused that problem then where we had these lines for our different years that maybe started in October and didn't come back to July or August or September. And then I think we had some others at the other end of the snow year. And so the question then was, how do we get the rest of the line? How do we basically add in zero values? I used a bit of a clue was to make a dummy data frame where I basically created a data frame that had all years, all months, and they were values of zero. Then if the data was missing in the real data frame, I brought in the zero from the dummy data frame to basically impute zero values. That just seemed a little bit klugey, right? So what I want to do in today's episode is another approach where I can turn the year and the month into a factor, and I can then have our preserve those factors in the output. And so it may not find them, right? So it right report a zero. But what that also means is that I don't have to worry about doing these weird joins with dummy variables. So that's what we're going to do in today's episode. If you want to work with the code that I'm working with today, go down below into the description, there's a link to a blog post that will get you all the instructions and information you need to get caught up with me. I'm going to be working out of snow seasons dot are that is in the code directory. Also this script sources a code local weather dot are you can put in your own longitude and latitude to get weather data from where you live so that it's much more relevant to you. I'm going to go ahead and run this script so that we can see the output file. So that script gives us this figure. And so this is what things should look like. So let me walk you through the code to explain how we built that figure. So again, as I said with source code local weather dot are we create this variable local weather that has the date the tmax the prcp and the snow. I then did a select on date and snow to get those two columns. I then did a drop na which removed all of the rows where we had na values. Now, what I could have done here instead of drop na is I could use a mutate on snow with if else. So if snow was na I could turn that into a zero right that would have worked. And that would have solved a lot of problems as well. But we don't always have data in that configuration. So again, we dropped the na's we then calculated the calendar year using the year function the month from the month function both of those year and month functions coming from the lubricate package. I then made the snow year so that if the date was before July 1st, then it was the the previous snow year right so here I am in August of 2022. That will be the beginning of the 2022 snow year. But if I was back in May of 2022, that would be the 2021 snow year right and so that's what this logic is doing here. And so then we see that snow data it has the month the snow year and the snow. We of course also see that there are perhaps multiple rows for the same month and year right so November 1892 has four observations because there are four snow events that were recorded in November of 1892. I then went about building a plot to summarize the total snow by year that gave us this nice line plot right so we can kind of see that it seems somewhat flat up to about 1965 and then it goes up. I then looked at the number of snow events over the course of the year filtering out those rows where snow was zero right and then counted those snow years and and then plotted the count and so that's what we see here is the total number of days with snow data over the past 130 years. All right so now we get to the good stuff so I created this dummy DF data frame that had all combinations of years and months and then I added a zero column a dummy column right and so if I look at dummy DF again we have all years and months as well as that dummy column. I then used a join to bring the snow data and the dummy DF together that then creates this four column data frame with the month the snow year the amount of snow and a column called dummy right and so because I did a right join rather than an inner join it will preserve all of the rows the month and snow years from the data frame on the right which was dummy DF right so if there's a snow year in a month in dummy DF that's not found in snow data then the snow value there will be an NA value right and so what we do in this line 44 then is if we see an NA in snow then we basically change that to zero if we don't see an NA then we keep the value that had already been there right so what I'm going to do is I'm going to go ahead and comment out these two lines and we can see what this would look like we can see now that this line I forget which year this is starts in November but doesn't have any preceding data and there's also an example over here that ends in May but doesn't have any following data for for say June and July okay I want to get back to having the complete data going out but without doing this right join with dummy DF I'll also go ahead and turn this off right and so what it turns out is that it the solution is rather simple if we're using factors so let me create another r script here so I can kind of do some demonstration of what I mean so let's create a variable that I'll call x and I'm going to use a function called sample sample allows you to generate a random sample of values so I'm going to do one to four and I'm going to get say a hundred values and then I'm going to replace equals true and so this will give me a hundred values of one through four right so if I look at x I see a bunch of one twos threes and fours right so I can actually turn this into a tibble uh so I'll do x equals all that and close out my parentheses and so now I have this column right so the nice thing that I could do is I can go ahead and pipe this into count on x and so now I see I have these different frequencies of the values one two three and four well let's say that I actually had five things that I was trying to count here right one through five how would we deal with that I don't see five in my output here so what I could do would be to do a mutate on x to make it a factor so we can do factor x so we turn x into a factor and it'll do levels one through five right and so if we look at this it doesn't really change the output any except that now instead of this being a double it's now a factor and if I then pipe this into count I still don't see any difference right I still see one through four but what I can do here in count is I can go ahead and do dot drop equals false that now gives me the fifth value of x right and we see that that is zero right and so count can only count what is there right and so if I got a vector of one two three and fours it doesn't know that there should also be a five or six or whatever right or it perhaps doesn't know that there should be a you know November 11th of 1938 right and so by making x a factor by telling it what observations should actually be there and then using dot drop with the count function we preserve that so this dot drop argument will also work nicely with a group by function so let me go ahead and grab these two lines to illustrate so count is really the same as group by and summarize where you're using the n function right so we can do group by x right and so now we see that we're grouped by x and there's four different values there and then we do summarize and we can do n equals the n function right and so that n function counts the number of things in each of our groups and we now see we have one two three and four of course we know that x is a factor with five levels so how do we get that fifth row well just like we saw up here we can do in group by we could do dot drop equals false and that'll say don't drop those categories where we are missing data keep that in there and so now we see that we get that five with the zero okay very cool so this is grouping or counting based on one variable x that would be like us doing it by year right so do we have years with missing data but we want to do months and years right because we have probably some months like say for that example I showed you like September of that year we didn't have any data and so we need to make sure that that that September of whatever year is represented by a zero in the data frame to illustrate that I'm going to go ahead and create another table and so again we're going to kind of use these silly values and we'll do y equals sample and let's sample from letters one through four and we will then do let's get a hundred of those and we'll do replace equals true all right so let's see what that looks like right we so you've got one column x with one two three four one column y with a b c d this letters vector is very handy it's all of the letters in the alphabet in uppercase very cool so again like we saw up here we can grab this mutate statement to make x a factor and we're going to pretend that we've got five levels right so let's also then make y a factor where we can say y is a factor with levels and I'm going to go ahead and do a b c d and let's go ahead and throw in z for for fun and so again we now see that both x and y are factors but I want to count those right so again we could do count x and y and this then gives me all of the combinations of one two three and four and a b c and d but we don't have those missing values represented again what we can add would be dot drop equals false so now what we see is that we've got 25 different combinations again five and five and we see that we've got zeros where we had z and we would also have zeros for that five a five b five c five d five z right cool and again we can do the same type of thing with group by and summarize so again we come down and we can then do group by x y dot drop equals false and we can do summarize uh n equals the n function and we get the same output okay so let's go back to our snow seasons data where again we have snow data right and we want to make snow year and snow into their own factors right so we can go ahead like we saw with that sample data and do a mutate on on snow year and we'll then say that that is a factor of snow year and we'll do levels equals 1892 to 2021 right and then for month I actually already had a factor statement down here where I went ahead and I recalibrated the calendar year to be a snow year right so going from August through December and January through July and let me make sure I've got all my right parentheses there we now see that month and snow year are factors this reminds me actually that I redid the year to start at August through December whereas way back up here I had uh July first here whereas defining the snow year so let me go ahead and change that it doesn't change the output any but it just makes everything a little bit more consistent so one other thing that I need to add of course to my group by is the dot drop equals false so that we preserve those groups that are missing those month and year combinations that are missing we'll go ahead and run this and we should see the tails to those lines we now have the complete data where this line was getting truncated prior to November and we have this line I think for May getting truncated as well and that goes out as well so that's great right one other thing I want to check is whether or not we had any years that were missing data I don't think we did because if we had years with missing data we would have a flat line across the baseline here but let's go back up to our code here where we were doing these different counts right and so again we had a group by snow year and you know what I'm going to go ahead and move this mutate statement all the way back up to where I'm defining my snow data because that's a pretty fundamental thing that I want to make sure is included because this needs to be included in snow data for all of the subsequent analyses right and so now I can do snow data group by snow year I can do dot drop equals false this gives us an error message that each group consists of only one observation do you need to adjust the group aesthetic looking at the plot I'm reminded that because we made each year its own category made it a factor right we're basically plotting discrete data across the x-axis rather than continuous data something I'm be tempted to do would be something like mutate snow year to be a double right so we could do like az dot numeric on snow year and pipe that in and so that works but we now see that our snow year isn't the actual year it's actually the index value or the row number the year number basically of the data right and so factors are a funky beast where it's basically a a numerical vector that has names attached to it and those names are called levels right so what we could do instead would be to do levels on snow year and make those a quantum a numerical value and so now what we get is our snow year across the x-axis and if I had only run to that point in the pipeline we see that we get those years if I had removed that as numeric and ran to that point in the pipeline I would find that snow year is a character right so I do need to convert it to a numerical value to get the continuous x-axis and so that I can then connect all those points looking at the plot what we find of course is that we don't have any years where we had zero millimeters of precipitation if we look at this next pipeline that I made for counting the number of snow events again we have that same problem where snow year is a factor and so again here in count we need to do dot drop equals false to retain all of the snow years and then we need to do mutate on snow year to be as dot numeric levels on snow year right and then again if we look at those first lines of the pipeline we see that snow year now is a double rather than a factor if I add a pipe to the end of that we then get our line plot and again that year in the early 1960s had a very small number of snow events as you'd expect because there wasn't a whole lot of snow that year finally I can go ahead and delete this dummy df code that just kind of gave me a bad feeling so as we've gone through this we've now seen a couple different ways that we can deal with those zero values so we make sure that you know July of a given year has zero because there was no snow in July so again what's important to remember is that with group by and count we can use the argument dot drop equals false on a column that is a factor type right and so we have a factor like our years or our months we can define all of the values that should be in that factor and then we can say dot drop equals false in group by or count and then our summarize or our output will include those categories those variable values in the output even if the value is zero because it wasn't in the observed data not a big change to the plot of the data but I feel better about the underlying code and I would encourage you to play around with using dot drop equals false in your own code dot drop equals true of course is the default so let me know if you have an application for this cool argument of count and group by I would love to see where you can use that in your own data let me know down below in the comments well practice with this share what we're doing here with your friends so that they can become better our programmers themselves I know I'm getting a lot better at this myself so keep practicing and we'll see you next time for another episode of code club