 Hey folks, if you watched the last episode of Code Club, what you really should do, you may have noticed that I seemed a little bit unsure of the analysis that I was doing. I was getting results and I think I had little pop-ups to this effect. I got results that didn't quite square with my experience or what I thought was right. Something seemed off, right? So I have gone back into my code and just to kind of refresh you and remind you what we're doing. I am looking at precipitation data collected at NOAA weather stations from across the globe and I am looking at the last 30 days of precipitation for each region on a degree-by-degree latitude longitude grid across the globe. I am summing up the total amount of precipitation for the previous 30 days and then I want to calculate the Z score for this year and every other year so I can see how drowdy was the last 30 days compared to all the other data that we have on record for that region. So we're at this point where I'm reading in precipitation data, station data, we're joining it together, we're grouping by the latitude and longitude, and then we're summarizing that all together. I then made a data frame that had the data for this year, for 2022, and then I did a join. And so I've gone ahead and inserted a filter statement in here with the latitude and longitude for where I live just outside of Ann Arbor, Michigan, and again the year 2022. And when I run this what I notice is that for this year the total precipitation was 0.117 centimeters of precipitation. Now I feel like we've had a fairly wet September. I'm recording this on October 7th and so I recorded the previous video on October 6th and so that would more or less mean that from like September 6th or September 5th or so all the way to yesterday there was a fraction of a centimeter of precipitation and that's just not true. Or maybe there's something wrong with the gauges and everything. And so I got to thinking what happened? Why is this happening? Because that doesn't seem right. And I actually went and spot checked a lot of other areas and found everybody seemed to have a really low precipitation for September. And while that might be true, I thought that seems a little bit fishy. So I recalled that when I went back into my ReadSplitDly files, this R script that reads in the individual files or the individual data for each of the weather stations and months, I recorded today's Julian date using the today function, right? And so that then that comes from Luber date, right? If I do Luber date on that, or sorry, I got to do library Luber date, right? I load that and then if I look at today that that comes up as October 7th, right? And so when I ran this was a week or so ago, right? And it was using data that was originally downloaded a couple weeks ago, actually, right? So if I look at LSLTH on data, the data that we're getting this from is the GHCND all tar GZ file, which again, I got September 8th, and probably has data from September 6th or 7th, maybe, right? And so if I then ran that on October 4th, right, hopefully you're catching on that if today was October 4th, we subtract 30 days to get to the window, we're at about September 4th. And so what that means that most of the data that we actually were looking at in that 30 day window were zeros, right? Because we didn't have the data, it was missing, it was being treated as missing data. Regardless of it was being treated as missing data, NA is whatever. If you sum up the precipitation over that 30 day window, it's going to be much smaller than it actually was, because we didn't actually have all the data from that window. So what I need to do is I need to go ahead and bump this to get the newer version of GHCND all tar GZ. So I could certainly rewrite how I did line 32 here. So instead of being the actual day of today, it could be the last day that we got data from the weather stations. But that's not really something that I want to be able to do on the fly because the way this is reading it in, it's basically getting chunks of the overall GHCND all file, right? We basically took that large 3.3 gig archive, which is really 30 gigabytes decompressed, and we're pulling out different chunks of that. So I don't really have the ability to get today, so to speak, from that data. I need to give it today. And I'm happy with leaving it as today. So I need to go ahead and basically update this file. Now, we have done everything in snake make to this point, you'll recall that we have a snake file way back up at the top here, I have this get all archive rule, which goes out and gets data GHCND all tar GZ. So I could delete this file, and then it and then run snake make on my targets, and it will then regenerate this file and then regenerate everything else. That seems a little in elegant. So what I'd rather do is force snake make to rerun this rule, right? And so one question we might have is, well, if we update this target, this output, what, what else will change, right? And so one thing that we might think about doing is looking at the DAG, the directed acyclic graph of what the overall pipeline looks like at this point. And so I want to make sure that I'm in the right environment so I can do conda, ENV list, I see I'm in my base environment, I'll do conda, activate, drought, to build a visualization of the pipeline, I can do snake make, hyphen hyphen DAG, and then targets. So recall that targets is this main target up at the very top. It's the very first rule that has the targets or the output of all of my different rules. So GHCND all tar GZ is there, of course. And so I can then pipe this to a special tool called dot. And then I'll use the argument hyphen T, PNG, to say that I want to output the DAG, the image as a PNG, and then I'll redirect the output to a file that'll call DAG dot PNG. And we can see all of the different rules and how they link back together. And so this get all archive, if I update this, that's basically the rule I really want to update because it's got that all data, right, that will then update the get all file names rule, the summarize DIY files, and then the final targets, which again is what's going into this merge weather stations, our script that I started out talking about today, right? So that's what we're thinking about. We're going to basically kick this rule to force snake make to update it, and then it'll update everything else. But because the inventory file, as well as the station data might also get updated along the way, I'm going to force snake make to do those as well. I'll do snake make hyphen capital R. And so capital R will force running of the rule, right? And so we'll do hyphen R. And then we will do the names of the rules we want to force it to regenerate. So we'll do get all archive, and then get inventory. And then get station data. And again, it's going to rerun those rules, and then everything else that depends on them. And so I'll go ahead and put hyphen C one, because it needs the number of processors or threads that it's going to use, and everything I've got has one processor. So this will run, it'll probably take 45 minutes to an hour to run because it takes a while to download the files. It takes a while to process the data. And then we'll come back and see what we actually get for the amount of precipitation in Ann Arbor over the last month. So it took about 45 minutes to redownload the data and reprocess the data to the point where we're ready to read it back in again into our merge weather station data R script. I hope you agree with me that snake make is just a really invaluable tool for keeping track of our pipelines. I think it's just so awesome that I was able to basically kick snake make and let it run walk away, come back 45 minutes later. And the analysis has been redone without any error messages. To me, that is the main bonus main benefit of engaging in reproducible research, I'm going to come back over to our and I'll go ahead and reload these different data frames and see what we get for Ann Arbor see if we get a value that's a little bit more along the lines of what I think it would be. And sure enough, we see that for this year, we had a total precipitation for the last 30 days of 4.26 centimeters again that 4.26 divided by 2.54 centimeters per inch about 1.67 1.68 inches of rain. And I think that squares pretty well with what I think our rain gauge out in the front of my house had accumulated over the course of the month. So yeah, that seems a lot more reasonable than tenths of centimeters. So I'm going to go ahead and grab this filter statement and throw it at the end. Because what I want to know is what was the Z score for this area, right? So I'll go ahead and rerun this up and I've got year 2022. There is no year column at this point in the data frame. So what we see is that yeah, I guess it was a little bit drier than normal, based on 151 years worth of data that we have for the Ann Arbor area. So I'd like to refactor my code. One of the kind of thought bubbles I had as I was watching back through the last video was that I probably didn't need to create this separate data frame with data for this year. And what I'd like to do is try to bring it all together into a single pipeline. I'm going to go ahead and grab this output, and I will plop it up here. And I'm going to comment it out because what I want to do is save this for when I refactor it, make sure I get the same result. Let's get some more breathing room here. And what I'm going to do is I'm going to forego these first few steps. So I'll go ahead and comment those out. And I'm going to grab the lat long PRCP and pipe that into everything else. Also go ahead and comment out this code chunk up here. When I comment it out, that's kind of like fodder for the cutting board. So again, we've got lat long PRCP, maybe I do need this up a little bit higher. I've got my latitude, my longitude, the year and the mean PRCP. So I want to group it by the latitude and longitude. And before we did summarize to get the Z score, where we got one Z score for each latitude and longitude. Again, summarize takes all the data in a group and synthesizes it down to a smaller number of values. Instead, what I will do is I'm going to use a mutate. And so this is going to create a column, but my calculations are going to be done within the group. Okay, so again, where I have mean all years, that was the mean PRCP column for all the years, right? So instead of mean all years, I'll do mean PRCP. And I'll also do this standard deviation of that as well. Now, I also did the min on this year column. And so that was the column that came from our 2022 data frame, to get us a single value for the X value basically for calculating our Z score. Because I'm doing a mutate, I'm going to get back the same number of rows that I put in. And to do that, then each row is a different year. I'm going to go ahead and put mean PRCP in here to then give me basically a Z score for every year for every region of data that I have. And now if we go ahead and run this, I see that I get different Z scores depending on the mean PRCP value. And if I run this whole pipeline, again, I'll get the Ann Arbor data. And I then see the Ann Arbor data going back to 1872, which was the first year full year that we had data for, right? And I've got this dot groups equals drop column, because it made a column, because this was leftover from the summarize function. So I'll go ahead and remove that. So let's go ahead and add to this filter year equals equals 2022. And sure enough, we get the same output that we had the previous way we did it, with a lot less code, right? That's pretty slick. Again, all that matters is that we get the right answer. But I like to have, I like to have simplicity, right? The other thing I can do then because through this point of the pipeline, we're going to have that column and I can go ahead and remove that column, I could then do select minus and to get rid of the end column. Again, if I wanted to see what that looked like with the Ann Arbor data, we would then see the Z score, I do see that I still have things being grouped. And so at the very end of this, I'll go ahead and do an ungroup on that. Maybe I'll move the ungroup back up here after the mutate, so that I don't have things being grouped for too long. It probably doesn't matter whether or not the data are grouped at this point. But I've seen weird things when data are being grouped for too long into a pipeline. So I always like to remove the grouping as soon as possible. Very good. So let me go ahead and clean this up. We'll remove this commented out code. Also go ahead and remove this final bit of code. I'll go ahead and also remove the filter for the Ann Arbor area. And I think what I'd like to do is go ahead and add in here the year being 2022. And then what I could also do is I can also remove the year column, right? So do hyphen minus year. And this then will give me the lat, the long, the mean PRCP and the Z score. I don't really need that mean PRCP. So I'll also go ahead and remove that. At this point, I may as well have said what columns I actually want. But hey, this all works. So we now have our latitude, longitude and Z score. Those are the three values that in the next episode, I will be generating a plot for so that you don't miss that episode. Please, please, please make sure that you've subscribed to the channel. You click that bell icon, you give me a thumbs up that way, you'll know when that episode is released.