 Hey folks, in science it is common that we have a set of entities that we measure a variety of different ways. We record that data in multiple files, right? So we might have one file that describes all of the entities, say all of the field plots where all of the plants we're studying or all of the weather stations that we're studying. And then we might have multiple other files about the different types of data that we're collecting on all of those entities. Well, that's exactly what we have here in the recent series of episodes I've been doing here on Code Club. We have a whole bunch of NOAA weather stations, about 122,000 actually. I measured the total amount of precipitation from these 122,000 weather stations in the last month. What I'd like to do today is go ahead and merge the precipitation data with the metadata that we have on those weather stations. You might recall from the last episode that I found that there are a few cases where there are 1,000 different weather stations within about a 70 mile diameter of each other. That's just too fine of a resolution than what I really need. So what I'd like to do is bring those data together. And when I bring them together, then I'll know what kind of region each weather station belongs to. And then for that region, I can calculate an average amount of precipitation. We'll be leaning heavily on the join functions, especially inner join. Along the way, I'll revisit some of the discussion we had previously about reading in fixed width formatted files, parsing dates, and then talking about how we can group by and summarize our data so we get one level of precipitation per year per region of weather stations. If you'd like to get the repository as it currently stands for this project, I encourage you to go down below in the description for a link to a blog post that will get you all the information you need to get caught up. I am here in my conda environment that I've called Drought. I've got R loaded. So something that kept me awake at night was, did I download too many files from the NOAA website? We have the GHTND inventory, as well as GHTND stations text files. And so this text file, GHTND stations, is what we used in the last episode to get the latitude and longitude for each of the 122,000 some files. I also have this GHTND inventory file, which wouldn't you know it has the exact same information I would like the latitude longitude name of the station, plus the type of data that was recorded, I'm interested in PRCP, and the starting year and ending year. The thing that was really keeping me up at night was worrying that I might have partial years worth of data in here. I know from the work I did previously using a weather station outside of Ann Arbor, that that first year 1891 was pretty incomplete. It started say in October. So I'd like to remove 1891 from Ann Arbor, and perhaps if data recording didn't end in 2022, I would also like to remove the last year from the data being recorded. So what I'd like to start out by doing is revisiting my merge lat long R script. And I'm going to change it to read in the GHTND inventory file. So let's go back to this read me. So let's search down for the inventory metadata. And I'm going to swap that in place for what I had before for stations, there's the stations metadata. And here is the inventory. Again, I'll go ahead and grab that, plop that in here. And we'll go ahead and comment that out because it's not our code, right? So now let's see what do we need to change to get this all to work. So to make sure that I've got the right positions and columns, I want the ID, the latitude, longitude, I don't care about the elevation or any of these other variables. So I'll go ahead and remove those. But what I do care about is element. And that's going to be columns 32 to 35. And then the first year being in columns 37 to 40. And then the last year, being in columns 42 to 45. All right, and then I also need to change GHTND stations to GHTND inventory. Good. And then let's clean this up, bring that back up here, this parentheses up here. I don't care about call select, because I need all of these columns that I've already looked right in. Okay, so let's go ahead and read in the tidy verse, as well as all this other code down through the read. So now I see I've got the weather station ID, the latitude, longitude, the element the first year and the last year. I now want to do a filter to keep only those elements that are PRCP. Again, that's the only piece of data that I really care about for this project. I can then pipe this into my rounding of latitude, longitude, we then do our group by longitude and latitude, and then we assign the region to each of those IDs. Again, a lot of this is reviewed from the last episode, if you were with me, great. So now we have the 7900 different weather stations. I think in the last episode, we had just over 8000, there must have been about 100 or so weather stations that didn't have PRCP data. So again, these are the weather stations that have PRCP data. And so then we get our ID, latitude, longitude, rounded to the nearest whole number, the precipitation column, the start year, the end year, and the region number. I don't need that element column anymore. So I'll go ahead and do a select minus element. And again, running that drops out that element column, we'll then pipe this to write TSV. I previously called it data GHCND regions. I'm going to go ahead and call it data GHCND regions years. Now, if I go ahead and save that and write that out, we now see over in my data directory, I've got GHCND regions years, I want to go ahead and get rid of that GHCND regions. So let's go ahead and open up another terminal. And I'll do RM data GHCND regions dot TSV. I also want to rename my merge lat long dot R script. So to do that in a way that is works nicely with get I'll do get MV code merge lat long to code forward slash get regions years dot R. So now I'll go ahead and close this R script because you see that VS code has struck in through that, go ahead and remove that fire that backup. And so we've got the code that we just modified. And if we come down to snake make now, again, I call this GHCND regions years to get that to be the correct target. And then if I come all the way down here, I did rule aggregate stations, I want to go ahead and do get regions years. And then my code, that changed, right. And so that now is code get regions years. And that's the input data. No, it's not because that was stations. Now it's inventory dot txt. And the output as we saw was GHCND regions years, go ahead and save that. And then we can do snake make a hyphen, let's do dry run just check. So it's saying everything is up to date. That's probably because I saved my get region years file after the data was the output was created. So I'll go ahead and remove data forward slash GHCND regions years. And let's try this dry run again. Now it says that we need to go ahead and run it again because we need to do the get regions years, which is what we expected, right? So I'll go ahead and do snake make hyphen C one, we only need one processor to do this goes through and very quickly outputs the updated version of GHCND regions years. Again, we see we've got the station ID latitude longitude first year last year region. I'm going to go ahead and commit these changes. So I'll do get status. And I see that I renamed that file. And then I've modified these other things. So I'll go ahead and do a get add on period. Again, I only like to use get add period after everyone gets status. And I know exactly what I'm committing. I'll do get commit hyphen and and then we'll say collect time span in years for each weather station. Now before I go ahead and merge my files, one other thing I want to look at is my read split dl y files are script. You'll recall that this works with concatenate dl y bash file, where we basically are pulling all the PRCP precipitation data out of our tar ball, we're splitting it into smaller files, and then we're zipping it. I'm zipping it into million line files. I'm a little bit worried about the amount of RAM that are is going to be using up when I'm doing this with GitHub actions. You recall that I'm limited to I think it was about 14 gigabytes of RAM. That's on a Mac version. If I use Linux or Windows, I think it limits me to seven. So I'm going to go ahead and reduce this to 500,000 lines, that should make it a bit more compact. So again, we split the data basically into 500,000 line files, we then zip it, and then we output it to a temp directory. And that then is read up by read split dl y files dot r. So I'm going to go ahead and run these first two lines. So I can create a file to test in my R script. Before I do that, I know I need to activate my conda environment for the bash. So I'll do conda, activate drought. So then running these lines by doing LS on data temp. I now see I've got all of these files. There's quite a few in there. We can do WC dash L at the end of that to see that there's 73 different temporary files on there. So I'm going to save this bash script. One thing that has me a little bit worried is right in here, I do a drop NA and a filter PRCP not equal to zero. So something else that I worry about late at night when I can't sleep is what if we've got a location in a desert where it never rains, right? So if the precipitation for the month was zero, then we'd have no data for that weather station. So I think I want to keep in the PRCP zero values, right? But what I really want to get rid of are those dates that don't play nicely, right? So as you might recall, if you look at the dl y files, each month is on a line, and they all are given 31 days. Now, not all months have 31 days. And so they then plug in NA values. And so in their case, they actually use the value negative 9999. So I'm going to go ahead and hold off on that drop NA for now as well. So I'm going to assign x then to be data temp x a dot GZ. And let's make sure I've got all these other lines loaded, including all these libraries. Great. So we need the widths and headers, because we used a different approach to read in the fixed width files. In this case, there's far more columns than there were in the one we we just did a little bit ago. So again, if I run this down to through the pivot longer statement, this is going to output close to 15 and a half million rows. So recall that we've got a couple of different types of problematic values in our PRCP column. We're going to have some NA values. And so those NA values will be for things like February 30. Right, there's no February 30. But also for days where maybe the sensor went wonky, or where there was no observations, I'm going to assume that the NAs for true dates are zeros. And that the NA is for like February 30, those obviously need to be removed. So let's go ahead and run the next lines with the mutate, where we create the date, as well as the PRCP as a float. So I'm going to go ahead and rerun this. So as we saw a few episodes ago, when we created this R script, when we use YMD on dates that don't exist, we get a warning message. So to silence that warning, I can give YMD quiet equals true. You'll see in that pop up that the the default for quiet is false. So we want that to be true. What that will do is that will turn like February 30 into NA. So I actually ran the whole pipeline, but you'll notice I didn't get any warning messages having done that quiet equals true. So that's good. So to get rid of the other NA values, those cases where it was a true date, but for whatever reason, the value that was recorded was NA, I'm going to assume that that was a zero. I think on the whole there aren't a whole lot of these values. I don't think it matters in the long run if I treat it as zero or as an NA, because ultimately it's going to get dropped out of the analysis anyway, because a zero isn't going to go to anything. So hey, why don't we learn a new function for dealing with NA values? We'll do PRCP equals replace NA. And we can then give replace NA the column that we want to work on. So PRCP. If you don't use a column, it'll treat all of the columns. And so what this is saying then is replace NA is going to use the PRCP column. And if it sees an NA, it's going to replace it with whatever value we want. So I'm going to say zero. And then we will feed that into the rest of the pipeline. It occurs to me as I see as numeric here that PRCP is being read in as a character type. So I'm going to go ahead and put that zero in quotes, because otherwise it's going to complain that my column is of type character, but I gave it a numeric value. Again, the values wouldn't really change here. If I'm replacing an NA with a zero, so probably didn't really matter in the long run. What I do need to do is remove those rows that had bad dates, right? So I can then do drop underscore NA on date. Again, these are going to be the values that when they came through the lubricate YMD function, previously would have given me a warning message. But because I said quiet equals true, they then drop out. And there we go. So just to remind us what we're doing, I was working in this file, because I was concerned that with these two lines, a weather station that was in a very dry environment would not yield any data. And so we might have like the Southwest US or parts of the Sahara Desert that never showed up in our analysis, because there'd be no precipitation, right? So again, by leaving in those PRCP equals zero values, when we get down to doing things like group buy and summarize on PRCP, those values will still be there. Okay, so let's go ahead and move those lines, save this, close these files. And then I am going to go ahead and rerun my snake make. So I'll do snake make hyphen C one. And that should recreate my GHC and the tidy TSV GZ file. So it took about half an hour to go and reprocess all those files. I feel a little bit better about the state of things though. So I think it was worth doing. Now what I want to do is go ahead and join together the data that we started with from the get regions years script, as well as the concatenated dl y files that we synthesized, so that we could get the amount of precipitation for the last month from all of our weather stations. Again, we will fire up a new R script. One of the handy dandy things with visual studio code is I can click here select language, and then I type R, and it will fire up an R script for me. Of course, I can go ahead and save this, and I'll say merge weather stations data dot R. And I'm going to open up one of my other R scripts to grab the shebang line. And I'll go ahead and grab the tidy verse while I'm there. Get that. And we'll go ahead back to our and make sure we've got our tidy verse loaded. So now I want to go ahead and read in the TSV files that I have from the GHC and the tidy TSV GZ, as well as the GHC and the regions years, because I want to join those together. So those are TSV, so do read TSV data forward slash GHC and the tidy dot TSV dot GZ that reads in with about three million rows. I'm going to go ahead and call this PRCP data. And then I'm going to read TSV to read in my GHC and the regions years. So we'll do data forward slash GHC and the regions years dot TSV, we get that. And I'm going to go ahead and call this station data. So now I want to join these two data frames together, which I could do with inner join using PRCP data and station data. And I'm going to join by the ID column, that's the first column, which is the weather station. And this then outputs all of my information of the ID, the year, the PRCP for the last month, latitude, longitude, first year, last year, and region. So one question we might ask, though, is do we have PRCP data that we don't have station data for when we do an inner join, we require that the value in ID be both in the PRCP data and the station data. So we can figure that out by doing anti join. And we'll do anti join on PRCP data and station data. And by equals ID. So those will say what IDs are in PRCP data that aren't in station data. So we get back to weather stations that we don't have station data for. I'm not super concerned with that when we have on the grand scheme of things, you know, 122,000 total weather stations. So it is interesting that this one weather station is missing four years worth of data. Again, I'm not going to worry about it too much. We could go in the other direction and say station data comma PRCP data that gives us the station data that we're supposed to have precipitation data for. But for whatever reason, we don't have precipitation data. And again, we're looking at the precipitation data for the last month. So I think I actually downloaded these data the middle of September. And so it'd be basically the 30 days prior to that. So middle of August to the middle of September or so. So that's missing. That is a bit weird. I have kind of spot checked some of these, I won't bother you with that. Some of them do seem to be in desertified areas. Other places are fairly tropical. So I'm not super excited about that going and digging in these. Again, this is for basically 1300 weather stations out of 122,000. And hopefully we might have multiple weather stations per region. So again, I'm not going to stress out about it too much. Maybe if we find that there's big holes in our map, we can go back and worry about that later. So I'm going to go ahead and comment these out and move them up ahead of my inner join. The inner join again is returning those combinations of IDs that were in both the PRCP data and the station data. And so I again, just to remind us, get the ID, the year, the precipitation, the latitude, longitude, the first year, last year, and the region. And so I want to do a bit of filtering. I am worried about partial data, right? So like this first year of data for this weather station occurred in 1957. So I would prefer to get rid of 1957. Also, I worry that 1970 might be a partial year. And so what I can do is a filter to do year, not equal to first year and year not equal to last year. So that will move any data that occurred in the first year of that weather station, right? So we shouldn't have any 1957 or 1970 data for this weather station, which sure enough, we don't. But if I do a filter on year equals equals 2022, which is this year, we get nothing. So I definitely want this year's data, right? So what I'm going to do is I'm going to create an OR operator. So I'll take that and put it in parentheses, and I'll do an OR year equals 2022. So it should get me everything that I want. Again, I don't have stuff from the first year the last year. But if I do a filter on year equals 2022, now we're in good shape. So now let's go ahead and group our data by region. So we'll do a group by region. And for the sake of keeping the variables through, I'm also going to do latitude, longitude a year. And we will then keep that and we'll do a summarize on PRCP. So I'll do mean PRCP equals mean on PRCP. So to remind ourselves what's going on here in this group by I'm grouping by the region, which is this final column, as well as the latitude and longitude. Now that I think about it, I perhaps don't really need that region column, because the way I've rounded this, I'm going to get those unique combinations of latitude and longitude anyway. But you know, who cares, we'll leave that in there. And then we'll also group by the year, because we want to get the average for each year so that we can compare this year to all the previous years. And then we're going to go ahead and get the mean amount of precipitation for that 30 day window. So we get back the region, the latitude, longitude, the year and the mean precipitation. So something to note about our group by variable is that the columns that we include in group by will be included in the output. So if there's a column, you want to be sure that you have include that in the group by, even if that's perfectly correlated with the other variables. That's the case with our region column here. It adds really no information over the latitude and longitude data. I don't know why I didn't think of that earlier, but we can go ahead and remove that from our analysis. If I remove the region, the data will still be grouped by latitude and longitude. So maybe something I'll add to this, then, would be another summarize, because I'm curious to know how many years worth of data we have for each latitude and longitude. So to that, I could say n equals n. So we get a data frame of about 7800 rows of latitude and longitude combinations with the number of years worth of data we have for that. I'm going to go ahead and save this as lat long precipitation PRCP, I guess. And so something I'm curious about, about lat long PRCP is if we do a histogram, so do ggplot aes x equals n, and then do geom histogram. So we see this gives us a bimodal distribution. Clearly, I wish I had everything to the right of that 100 mark. Maybe there's 2000 weather stations in there, but there's a big chunk of them that are below 100. Coming back to my code, let's see how many we have that the n is greater than 100. I'll do lat long PRCP filter n greater than equal to 100. I've got an extra character here. And so I see that there's about 2133 where I have more than 100 observations for. So I'm not going to worry about that just yet. We'll save that for the next episode, where we talk about synthesizing the data further. And so I'm going to go ahead and remove this part of the pipe. And I think I'm going to keep doing the rest of the analysis for building out the visual here in this weatherstationdata.r script. I might change the name of it later. And we've seen how to do that already in this episode. So I'll go ahead and save this and quit out of R. I'll go ahead and stage and commit my changes. I'll do get add, period, get commit, have an M, I'll do merge weatherstationdata with precipitation data, and misspell that. All right, now we can push that up. And so we're ready for the next episode, as I mentioned, where we will start thinking about how we can synthesize the distribution of the data for each weatherstation. What I want to know is if I look at the past month, how does that compare to the variation we've seen over the past however many years worth of data that we have for each weatherstation. So that you don't miss that episode, please make sure you've subscribed to the channel, you click the bell icon, and of course help spread the word by telling your friends about what we're doing here at Code Club. Keep practicing, and we'll see you next time.