 Hey folks, a few weeks ago I made an episode trying to quantify the presence of drought near where I live here in southeastern Michigan. I used data that was collected at a NOAA weather station outside of Ann Arbor just about 10 miles east of here. As I mentioned in that video, there are a number of ways of quantifying and characterizing drought. The approach that I used was to use a sliding window and to calculate the total precipitation within that sliding window. To generate the sliding window, we used tools from a new cool package called Slider. Well, in the comments to that episode, a really nice comment came up from Benjamin Boulet, hope I'm pronouncing that correctly, basically saying they were interested in kind of the length of time, the number of days between any two precipitation events. And they were curious if that distribution of lengths by year has varied over the time that we have recorded data for. So what we're going to do in this episode is we are going to use the lag or lead functions to look at drought the way that Benjamin is asking. Jumping over to our studio, you will see that I have a fresh R script with at the top a source to a code local weather dot r script. This local weather dot r, if you've watched any of my episodes, you're hopefully familiar with at this point. This script allows you to insert your own latitude and longitude to get the local weather for where you live. Again, what I'm getting is for a NOAA based weather station outside of Ann Arbor. I'll go ahead and run this. This will also load the tidy verse meta package that gives us things like Deplier and whatnot. The data frame that's generated is called local weather. And if we look at local weather, we get the date, the T max for each day, the amount of precipitation and the amount of snow for that day. So I'm only going to be interested in the date and the PRCP column. I'm also going to assume that NA values in PRCP is the same as zero. That might not be totally right. But I'm more interested in the process than the actual answer of quantifying drought this way. And so just to make things simple, I'm going to change those NA's to zeros. What my process then is going to be is I'm going to take this data frame with 47,521 rows, and I'm going to remove any row that has a precipitation value of zero. Or we could also think of this as a precipitation value below some threshold that I'm going to count as like a legit amount of precipitation. That then will give me a filtered down data frame. With that filtered down data frame, I can then use either the lag or the lead function on the date column to get a column for today's date or the date as we're looking back for the length of that window, as well as the lagged date. And so then we can calculate the difference between the lag date and the current date for each precipitation event. And then if we then pull out the year from the date, we can then begin to do all sorts of different aggregations by year to look at things like perhaps the median, the range, the max, all sorts of different statistics that we can then feed into GG plot two to make a nice attractive figure to tell an interesting story. First things first, let's go ahead and take local weather. And I'm going to go ahead and select date and PRCP, giving us those two columns. The next thing I want to do is replace those NA's with zeros. In previous episodes, I've done a mutate with if else, so that if PRCP was NA, then we replaced it with zero, otherwise we left it as PRCP. And so there's a replace underscore NA function that I want to show you how to use briefly here. So we'll do replace NA, and then we'll give replace NA a list function where we will then say PRCP equals zero. And that will replace all of our NA values in the PRCP function with zeros. If I had left out this list function to turn this into a list, it gives me an error, right? So again, the argument to replace NA needs to be a list. Alternatively, we could also do it with the mutate function. So we could do mutate PRCP equals replace NA on PRCP. And we're going to then give that a zero, right? So again, we're going to mutate the PRCP column using the function replace NA, where the vector we're going to change modifies PRCP. And if it's an A, then it's going to turn to a zero. And again, we get the same result that we had before. Which you use doesn't really matter. I'll go ahead and comment out the second one. And then that will be in the pipeline in case you want to use that second version instead. So the next thing I want to do is go ahead and filter out my data for where the PRCP value for a certain day was at or below a certain threshold, right? And so I can do that with the filter function. And we can then say PRCP greater than a threshold. And the threshold by default will probably be zero, right? So if we do threshold equals zero, and then we run this, we now see that we don't get anything with zeros for precipitation. Whereas if I say set the threshold at five, I now see that I no longer had that 4.1 that showed up on October 7 of 1891, right? We see that that's missing, right? So I can very easily modify the whole script by changing my threshold. Again, the PRCP values are in millimeters. So that's going to be zero millimeters or five millimeters or whatever thresholds you put in there. So let's go ahead and leave that at zero, because that's a fairly strict definition of precipitation, right? I don't know that like 1.3 millimeters of precipitation is really anything to get too excited about in terms of breaking a drought. But this is kind of how the original poster asked the question. So let's leave that in there at zero, but know that you can begin to play with this by changing the threshold. Now what we want to do is go ahead and generate a lagged date. So I always get lag and lead confused. And so sometimes just the easiest way for me to do it is to do both and see what happens. So we'll go ahead and do a mutate. And we will then say lagged date. And that's going to then be the lag function on the date column. I now see that I've got my date and my lagged date, right? And so basically what's happening then is that we see that the current date would be the first column date, right? And then the date that it's going back to would be the lag date, right? So that's the lag. Whereas if I did lead on date, then what I find is that it's kind of moving forward in time, right? And so we have October 4th as the current date and the previous date. Maybe it'd be easier if I call this previous date, right? So pre-date, right? Then I've got my previous date of 10.7, but that's further into the future than today. So what I want is a lag. I'm doing a one-day lag, but you could also do like n equals five. And this will then lag everything by five days, or it really rose is what it's lagging it by. Lag doesn't know that this is time data necessarily. Of course, what it's really looking at is rows in the data frame. So it's really lagging the vectors by five units. It's lagging those columns by five units. But of course, I want a one-day lag. And so we'll leave that in there. I now have this NA. So I'm going to go ahead and drop out that NA. So I'll do drop NA. So that drop NA then will remove that first NA. I'm wondering if it also puts an NA at the very end. So if I did tail on this, I see that it goes to August 29th back to August 26th. And so no, it doesn't add a NA value to the column being lagged. So it doesn't add an NA to the date column. So now I can go ahead and add a column to say something about that length of time. And so I can then say mutate drought length equals, and that'll be date minus minus pre date. And so then this gives me a column that is drought length. And so these in terms of days. And so what I'd rather have this be instead of a day, like this, this is DRTN. So this is coming from the Luber date package, would be to say 371. I don't need that extra ornamentation of saying the day. So maybe what I'll do is I'll put as.numeric around the date minus the pre date. And so now what I get is a double. So I get the actual number of days. Cool. The other thing I want to get is the year. And so I can then say year equals the year function on date. So the year function comes to us from the Luber date package, which was loaded with the local weather dot R script. And so now what we see is that we've got the drought length and the year. And these are the two variables that I'm really interested in, right? So I'm going to go ahead and take this pipeline. I'll save it as drought by year. And this then again gives me that drought by year where I've got drought length and the year. Maybe to clean it up, I'll go ahead and do a select on year and drought or I'll do length equals drought length. Right. And so now if I've got drought by year, I've got these two columns. Cool. So I can begin to kind of start thinking about how I want to analyze and visualize this data. So if I take drought by year and let's do a filter year, let's just pick a great year like 1976. It gives us 1976. And then we can feed this into ggplot aes x, we're going to put across x the length. And then we'll do geom histogram. And I'm going to go ahead and save this to gg save for figures drought lengths.png. I'll do width equals six height equals four. So looking at this histogram, we see we've got the four three to one, there's no zero data. And that would be because we wouldn't have the same date in both the date column and the preve column. Also, it doesn't totally make sense to have a one day drought, because that would mean that it rained today and it also rained yesterday. That's not really drought. Right. And so what we could think about doing would be to stay define droughts as, you know, days or rain events that are more than one day apart from each other. Let's leave that in here. But I think it's interesting again to look at this data. Clearly, it's not normally distributed. We could add in the mean or the median. I'm going to leave them both in. We also see that we've got 16 days as the longest period of drought in here as well for 1976. Something else we could begin to think about would perhaps be to look at like the 75th percentile. And so kind of what was that length of drought that might be somewhere around here of like seven or six, right. And so what we can do is we could again take drought by year and let's go ahead and do a group by year. And then we'll do a summarize and we'll then say median on the length column mean on the length column. Right. And then we'll do max. Right. Let's look at the upper quartile. So we'll do the upper quartile. We'll do quantile length and then prob of 0.75. And so then this gives us back the median, the mean, the max and the upper quartile. Thinking about this again, I don't know that it really makes sense to include those one day droughts. Right. So I'm going to go ahead and remove that. Right. I'll come back up to the top here. Right. And I'll subtract one from drought length. And so now what we'll get is the length of droughts. Again, defining a drought is being more than a one day lag. And so we'll see that, you know, for at least 1891 say that the median length was two days. The mean was 2.18. The max was eight. The upper quartile was three, four or five. Right. For that first 10 years. Okay. So what I will do is I'm going to now go ahead and look at different ways of plotting these four statistics over time. One other thing I think I'll add to this will be an N because it'll be like, you know, how many droughts were there over the course of each year. So I can then do N equals the N function. Right. Cool. And so we can see 1891. Well, that was a partial year. So I'm actually going to go ahead and remove 1891. I could also remove 2022, the year I'm in currently, because it's a partial year, but I might also want to know like, where are we so far here in 2022? So I think I'll leave 2022, but I'll go ahead and filter out year not equal to 1891. All right. Cool. And so now that's gone. Great. So let's go ahead and feed this into GG plot. And on the x axis, I'm going to put the year, the y, well, let's start putting different things. Actually, you know what we can do? We can go ahead and make a faceted plot, right, where we could have five different rows for each of the five different statistics, and we could have year across the x axis. And that will help us to think about, you know, what depiction of the data do we like best? All right, let's do that. So we'll do pivot longer. And we'll do everything but the year column. And so, of course, that gives us the year column name and value. So I'll go ahead then and type this into GG plot. My y value is going to be value. And then we'll do a geom line. And then we'll add the facet wrap tilde on the name. And I will do n call equals one. So that'll ensure that we have one column and five different rows, I believe. Yep. And then I need to change the aspect ratio of my plot, so that it's taller than it is wide. I'll go ahead and make a width of four and a height of seven. So I've got the same y axis for all of these, which makes it kind of hard to see what's going on. So let's go ahead and to facet wrap will do scales equals free underscore y. I'm really happy that I showed all five of these depictions of the data here in a single figure. I think this really highlights the value for exploratory data analysis of doing something like a facet wrap like this, because I can very easily see all the data together. I'm not seeing a whole lot happening with the max the max length of a drought period. There's some kind of funniness happening in the middle of the last century here, where there were droughts over two months long. And so that's that's interesting. And that's kind of skewing some of these mean values. But what we kind of see is that the mean length of drought across a year is kind of falling with time, right? And we can also kind of see that here with the median, where you know, the median is basically like a one day interval. And then it drops down to zero, right? And so that's, that's not so helpful. There's not a whole lot of information here. There's really only a couple values being depicted on the y axis. The N is interesting because it shows the number of drought events, which you can also think of as the number of rain events varying over the past century or so past 130 years, where it seems kind of flat, kind of steady between 1891 and 1960. And then it increases to about 1980. And then it's kind of found a new plateau, right? Right up here, probably around 170, 160 events per year, right? And then the upper quartile kind of shows what we saw with the mean and also the median, right? Where, again, it there were longer intervals between rain events prior to 1960, 1970, but then that falls off. And so the two data points that I really like, the calculations are the mean and the end. So I think the story is that the average duration of periods between rain events is decreasing with time, right? That it's actually, we're getting more rain events, not fewer, at least here in southeastern Michigan. Again, this would be interesting to try in other regions. So I'm going to focus in on this mean data. And let's see if we can make that look a bit more attractive. So again, what we'll do is I will go ahead and remove this pivot longer, and this facet wrap, and my Y will be mean. And then we'll throw this into Geome line. Of course, I've got the funky aspect ratio here. And so let's go ahead and flip that to make five, four. So I'm going to go ahead and start by cleaning up the axis titles. And so I think what we'll do will then be labs x equals a year, y equals average number of days between rain events. And then I'm also going to clean up the x axis labels to have it go every 20 years rather than every 40 years. So I'll do scale x continuous. And then I'll do breaks equals seek 1880 2020 by 20s. Wonderful. I'm going to go ahead and get my theme classic going on. And we'll add that. And I'm going to add a title to tell a story about what was happening. And I think what we'll also do is maybe cut the y axis label to go across two lines, maybe I'll put a blind break between days and between right there. I'll do that with the backslash n. And then I'll do title. And we'll say the length of drought has been decreasing over the past 130 years in southeastern Michigan. And we can clean up the title a bit by using a theme. And we can then do plot dot title dot position. And we can say plot on that that moves it to the left. And so then we need to go ahead and break up the title. I'm going to make it a little bit larger. And let's do that with let's do plot dot title. And I'm going to say element element text box simple. And I'll do size equals let's do 18 to make it larger. And it's complaining that doesn't know that function, because I need to load the library on gg text. One of the nice things that we see that element text box simple did was in addition to making our font larger, it'll also automatically wrapped the title for us. And so I think that looks pretty nice. A couple of things that I want to change. I'd like to add a little bit of a margin between the title and the plot. And I'd also like to highlight that decreasing. So what I'll do is in here and decreasing, I'll use some HTML. So I'll do a span tag. So I'll do span style equals and then single quotes color. And let's do blue. And then after decreasing, I'll do a back anchor on that. So that got me decreasing in blue. And then to add a margin, I can do that here in element text box simple by doing margin equals margin and then be for the bottom. Let's go ahead and put in 10. That might not be right, but we can adjust later. That looks good. I think one thing that would help tie that blue and decreasing together with the plot is to go ahead and add a fitted line, a smooth line to the plot. So I'm going to come back up here and do geome smooth. And also do se equals false. So they don't have that gray cloud. Yeah. And so that gives us then our blue smooth line through the data. And again, I think that blue of the line ties together with the title to indicate what we're seeing is this decrease in length of drought periods over each year. Right? So the final thing that I'd like to do is perhaps let's go ahead and change our threshold. So what would happen if we increased our threshold to say be five millimeters? So if we did five as our threshold and then reran everything. So we see that with a five millimeter threshold, we still kind of see the same type of result, right? And we see that it's pretty flat between 1890 to 1960. And then after 1960, the number of or the length of periods of drought seems to be decreasing over time, right? And of course, we could do the same thing where we, you know, maybe take this up to 10 millimeters, right? So one centimeter, it's about half an inch. And that looks very similar to what we had with the five. I'm going to go back to zero for my threshold, because I think that's more in line with what Benjamin asked in his comment. Thanks again to Benjamin for asking the question that he did. If you have questions like Benjamin did, by all means down below in the comments, ask away. I would love to revisit some of these questions in a future episode. And I think I've already got one in the queue for the next episode. So let me know what you think about this. Let me know if you have other ways that you would perhaps think about drought. Drought is a really big topic right now here in the world out west in the United States. Lakes are at historic lows. In Europe, rivers are at historic lows as well. And it's just really, really jarring, right? So let me know what you think about this. And I'll see you next time for another episode of Code Club.