 Hey, folks! If you've watched any of my previous episodes of Code Club here, you might detect a fair amount of editing goes on to kind of clean up the material, to make it flow a little bit more crisply, to cut out all those lags for how long it takes to run things in R. Well, what happens when I do that is I give a kind of inflated impression of how well I code, of how fast things move, right? So what I'd like to do in today's episode is to do a single cut. I am going to not edit anything in the middle of this video. So I'm a little afraid of how this will go. You'll just have to trust me that I haven't done this take like 20 times to try to get it right all the way through. I don't have that much time. Anyway, what I'm going to do is in today's episode of Code Club, I'm going to build off of this R script that I've been working on in recent episodes. If you want to get this script, as well as everything that's in my repository down below in the description, there's a link to a blog post for everything you need to get caught up and running. I'm going to go ahead and fire up a terminal here in visual studio code. I will launch myself into a conda environment. So I'll do conda, conda, activate, drought, all right, and I'll fire up R and let's go ahead and run these first couple of commands, I guess first 15 lines or so. This will give us a lat long PRCP data frame here, which has the latitude, longitude, the year, the mean precipitation. I see that it's being grouped by latitude and longitude, which is probably okay. So my goal for today is to go ahead and calculate the Z score statistic for each station. So the way I've kind of rounded the latitude and longitude is that I've rounded all those values to the nearest degree. So basically each weather station is within a diameter of about 60 to 70 miles, I think. And I have for each weather station then data in this case going back to the year 1958. And I'll assume it goes forward to the year 2022. So that's actually something I want to make sure is that I've got data for each of the weather stations that includes 2022, because I want to calculate a Z score statistic. The Z score statistic is the you need you need the mean, you need the standard deviation. So the statistic is the observation you're interested in. So 2022 minus the mean divided by the standard deviation. That will give you a value that is basically the number of standard deviations that you are outside of the mean. And so what I'm ultimately going to do is generate a rasterized excuse me, rasterized image of the world that shows in kind of red and blue colors like heat map images of the level of drought that each of these weather stations is experiencing across the globe. So I'm trying to think of how I want to do this. I guess maybe I should have thought of that before I started. But hey, this is real, right? So maybe what I will do is go ahead and make sure that I have a year 2022 for each of these latitudes and longitudes, right? And so maybe what I could do would be to pipe this into a filter. I feel like I need a filter, right? But you know, I think the easiest way that my mind can think about it would be to make a sub, a sub data frame or a separate data frame that has the 2022 data. I think ultimately I'll need that anyway. And so maybe I'll come back to this lat long PRCP. And I'll go ahead and do a dot groups equals drop to get rid of those grouping data. And then for the current year, I'll go ahead and do year equals equals 2022. This then, as we see, gives me my lat long and the year and the amount of precipitation for the past 30 days, I think it was 28 days, or yeah, 30 days. And again, these data were initially grabbed at the beginning of September. So this isn't like current data, but it's good enough because it takes a fair amount of time to download the data and process it all. So anyway, so I'll call this this year, right? And good. So that stores us a data frame called this year. I could probably go ahead and remove the year. So I'll do select minus the year. And so then if we look at this year, I see that I've got the lat long and the mean precipitation. That's all good. And so now we'll take the lat long PRCP. And I'll do an inner join on that, right? So I could do an inner join the inner join lat long PRCP and this year. And so I'm going to join by two different things. So I'll join by the latitude and the longitude latitude and longitude, right? And so then that should, what did I do wrong? I misspell it latitude. Oh, I forgot to put it in quotes, right? So that needs to be in quotes. And the linter doesn't like having it like spaces next to operators, like an equal sign. Okay, now we've joined things together. And so now we get latitude, longitude and the year. And so now, because we've got data for this year, latitudes and longitudes for this year, we're only getting back data for those weather stations. Those are latitude and longitude, where we have this year data. The other thing I'm noticing is that I've got mean PRCP dot x and mean PRCP dot y. That means that the mean precipitation was in both of the data frames, right? And so the dot x would be for the lat long PRCP. And the dot y is for this year, right? And so maybe I'll go ahead and rename those. So I'll do rename. And then I'll say, and I always forget which is which. And so maybe I'll do all years equals mean PRCP dot x. Let's see if that worked. Hey, it worked. What do you know? And then I'll add to this this year. This year with the underscore equals. And yeah, it'll be the mean PRCP dot y. Okay, and so now we've got this year and all years, we're in good shape. And so now what I need to get is for each latitude and longitude, I need to get the mean and the standard deviation. So I'll go ahead and then do a group by latitude and longitude, right? And then I'll do a what do I do? I'll do a mutate. Let's start there. And we'll do mean equals the mean on all years. And then SD being SD on all years. Right. And so here we now see that we've got the mean and the standard deviation as well as this year. I think something else that I'd like to have in here would be the number of rows, right? So maybe I'll go ahead and add an n column and do that with the n function. And so then this gives me that this weather station had 37 years worth of data. And so I need to think about like, what's the minimum number that I'd like to use. And so in the last episode, I generated a histogram and it was kind of bimodal. There are about 2000 weather stations that had over 100 years worth of data. So let's let's maybe start with requiring at least 20 years worth of data. So to do that, I'll then do filter year or not year n greater than or equal to 20. And I've been saying stations, I'm really looking at regions, because some regions, and that's the latitude, longitude combination contained multiple stations. So I'll do that. And then that gets me down to 313,000 rows. Didn't really lose much, right? So most weather stations, excuse me, most most regions have more than 20 years worth of data. So that's pretty cool. Actually, let's see what happens if we go up to 100. And so then we have 200,000 rows, right? And so that would be like 100 years. Yes, that makes sense. So 100 years for 200 or 2000 stations would get you close to 200,000. If we do 50, that'd be like 300,000, 296, 296, zeros are zero divided by 50. There'd be about 6000 or so. Let's let's go with 50. I'm happy with that. Okay. So at this point, then we have the mean, the standard deviation and the n, as well as this year's data to calculate our Z score, right? And so, again, what I want is, I want the Z score for this year. And so we'll go ahead and pipe this into another mutate. And I will then do, you know what, I don't need to do that. Because I think I can calculate my Z score right up here without actually calculating the mean and the standard deviation as separate columns. And if, yeah, that should work. So what we'll do will be Z score equals, what, we will do Z score equals, and it'll be this year, minus the mean of all years, right? And that needs to be in parentheses, because this will be the numerator divided by the SD of all years, right? And we'll go ahead and put a comma there. So let's see what this looks like. So again, we get a Z score of negative 0.644 for this weather station, right? So I want to double check that. And so I will go ahead and do a filter for n greater than equal to 50. And latitude equals equals minus 53. And longitude equals minus 71. And oh, I think I wanted, and I, yeah, no, actually, you know what, I don't need all that stuff, because I have this year being zero, right? And so the mean and the standard deviation. So if I had zero minus 3.05, put that in parentheses, divided by 4.73. Yeah, that I mean, that is what that is, right? So that makes sense. Maybe what I would also like to do would be to use the local weather of Ann Arbor, where I'm at. Again, that local region is going to be pulling multiple things together. So if I get the Ann Arbor weather station, it'll be close, but not not exact. So I'm going to go ahead and open up another bash strip. And again, if we look at our data directory, we've got this GHCND stations, I think I can do a grep on Ann Arbor, I think all of the text in these files was all caps. And then we'll do data forward slash GHCND stations, text. And so then that at least, yeah, I guess I don't need to look it up if I knew the latitude and longitude where I am, right? And so that should get me 42 and negative 84. Right. So if I do negative 84, or no, latitude is first, I think such to be 42 and negative 84 for Ann Arbor. And so that then shows me that this year, we've had 0.117 centimeters of precipitation for the previous 28 days. That was like the month of August, right? So I think that makes sense. And that then was a bit dry. So I think that will be good. Yeah, so that then gets us negative 1.8. And yeah, so that's almost two standard deviations drier than we would have expected normally, right? Okay, so I think that looks good. So let's clean this up a bit, right? So let's go ahead and remove our mean and our standard deviations. We've got our z score in our end, because again, we want to make sure that we have at least 50 years worth of data for all these, right? And so I think that's good. And so I'm wondering, do I want mutate? Or do I want summarize? And I think I actually want summarize, right? Because I only need the output, right? I need only need the single z score value, because you'll notice that all of these years have the same value, because we're using the data from this year, right? So now what I get is I get my latitude, longitude z score. And so I'm wondering why I've got all sorts of data here. And why I don't just have one value for that latitude, longitude. So again, that's the group by latitude, longitude, I should be getting out a data frame with 3400 rows, because I would have one for each region, right? And so then summarize z score gets this year, minus mean all years. And I think what's happening is that because this year is a long is a vector, right? That I'm getting one value for each value of this year. And of course, because this year, there's a value in this year for every year, that's the same. Yeah, that's why I'm getting a really long vector and getting basically out what I had before with the mutate. Does that make sense? I don't know. So I need to collapse this year down to a single value. And so there's probably a more efficient way to do this. But what I'm going to do is to say min on this year, because min of 1000 or 100 of the same value is going to be the same value. So now when I run this, yeah, so now I get 2889. Is that what I said up here? Well, here I've got 3400. Yeah, and so it's also filtering out those regions that had 50 or had fewer than 50, right? So now I've got Yeah, so now I've got 2889 regions, and there's e score and their n. That's cool. And I could then go ahead and ungroup this, right? So I'll then do dot groups equals drop. And there we are. Cool. So I think that then puts us in a good position to start thinking about plotting this data. And yeah, so let's see, what what did we do? So again, we created a data frame that contained the data for this year. And that then allowed us to do a join with our other data, the lat long PRCP, right, where we joined by the region. So the latitude and the longitude, we then renamed the years to make it a little bit more easy to work with, right? And then we grouped our data by latitude and longitude, we then summarized to get the z score where we use the value for this year, minus the mean of all years divided by the standard deviation of all years, that again gives us the z score, which is a metric which the number of standard deviations beyond the mean either drier or wetter than we would otherwise expect. And then we filtered everything out to keep those weather stations that had more than 50 observations. Okay, so I think that's pretty good. I think that's enough for today's episode. In the next episode, we will go ahead and make a plot of this. And I'm sure along the way I will maybe drop in little things if I made goofs. But hopefully I didn't. And hopefully again, you get a sense of kind of me talking out loud and me thinking through how I do this analysis. And you can see a little bit more of how much my fingers fumble around the keyboard, and making mistakes, and just little errors we saw like latitude and longitude not being in quotes. So I hope you found this helpful. Again, I don't want to leave the impression that I am perfect, and that my coding skills are perfect. Most of the bugs that you see in the videos are honest to goodness bugs, where I just incorporated things, right? I forgot quotes, or I misspelled things, or I forgot, you know, parentheses or something, right? So I want you to realize that that's normal. There is nothing wrong with you. If you're introducing typos, or if you're getting bugs, a huge part of learning a program is figuring out what error messages mean, and, you know, how to resolve them. We didn't have too many in this episode. But again, I hope this is helpful in kind of normalizing failure as part of getting to success. Thanks, and we'll see you next time for another episode of Code Club.