 Hey folks, over the last several episodes I have been building out a project where I'm trying to highlight different aspects of tools that we can use to make sure that our analysis is more reproducible. We've talked about things like version control, we've talked about package management with things like Kanda or Mamba, we've talked about pipelines with Snakemake, we've talked about R, we've talked about Bash, we've talked about a lot of great tools. I love these episodes because it really allows me to show you how different tools tie together to bring about a great result. The final result that I want to bring about is a web page that gets updated every day when the NOAA website gets updated to show a world map indicating the level of drought for the past month for different regions on the globe, right? And so we're going to look at the amount of precipitation over the past month relative to that same window of the year for the past 100 or so years. A tool that we're going to use to help do that is called GitHub Actions. At least that's my plan. That might change, but that's my plan. GitHub Actions allows me to automate this process, but when you're using other people's resources, you have to follow other people's rules. One of the sets of rules that we have to be mindful of is that we can only use 14 gigabytes of drive space and 15 gigabytes of RAM at any given time. So in the last episode we talked about some of the challenges that we are going to face because the data we're getting from NOAA is quite large. In the last episode we solved the problem of it taking three weeks to read in that data effectively and quickly into our computer. We shortened that down to from three weeks to a couple of minutes. And the challenge that I am now facing is when I try to read that concatenated file that's compressed and is relatively small into R, what I find is that reading it and then which is wide and then pivoting it longer to make it longer takes up quite a bit of RAM. I've got my top running down here and you can see that my R session as it's currently processing this whole pipeline is consuming 26 gigabytes of RAM. That's clearly way too much and is going to force me to find a different solution for what we're doing here. So as I think about the data a couple things occur to me for how we can clean this up even further to make it a lot smaller. So a couple things come to mind. So first of all at this stage where I am pivoting longer and making a date and converting the precipitation from tenths of millimeters into centimeters, I don't need to do that on the full data set with the 36 million rows or however many rows there are. Instead, I could chunk that. So instead of having to worry about taking up 26 gigabytes all at once, perhaps what I could do is I could take instead of 36 million data lines all at once, I could take 36 files that each have a million lines and then I could pivot that longer I could do these conversions. So that's one thing. Another thing is that it occurs to me that a lot of the data in these files are zeros, right? So if we're looking at precipitation, most days don't have precipitation, right? And so I can remove those zeros to also clean up a fair amount of space. So to do all this, we're going to need both bash as well as R. So I already have a file here for reading in this concatenated DIY file. I also have the script that we made in the last episode for concatenating the files within my tarble, right? And so I'm going to show you how we can continue to develop these two scripts so that they work together to integrate to come up with a file will be a lot smaller and use a lot less drive space as well as RAM that again in the end will make it possible for us to put our analysis up on GitHub to run GitHub actions. Something to keep in mind is that if you have a Mac and you're using bash, the odds are good that the bash version you are using is quite old. So before we get going too much further, I'm going to make sure that I have the latest version of bash available for my project. So again, I'll do a conda ENV list to see which environment I'm in. I'm in my base. I want to be in that drought one. So I'll do a conda activate drought to get the latest bash tools. I can do Mamba install core utils. And so again, this will get us the latest version of the core utilities for running on the command line. I will also want to come to my environment.yaml file and add core utils to the end of that. I see that core utils version 8.32 was what was installed. So I'll go ahead and pin that version there. Save that. And so the next time we burn down and relaunch our environment that that will get loaded. But for the time being, we have the latest version of those command line tools available to us here in this project. Great. So the first tool that I want to tell you about is a tool called split. Split is a tool that will take a file and it will then split it into many files, right? And so we could imagine having this data tarball, this ghtmd all targz. Again, if we look at that, it's got 3.3 gigabytes. So, you know, perhaps we're sending it over the internet and we don't want to send 3.3 gigabytes all at once. We'd rather chunk it up into smaller pieces. This is one of the common places where people use split. So if I look at the help page for split, I will see that there are a variety of ways that I can split a file. Some of the arguments that I will commonly use with split would include b for the byte size. And so that way we make sure that every file is only so big. Another that I will often use is lines. And that way I put a specific number of lines into each file. And another is n for the number of chunks. So say you want to generate 10 files that are split versions of the bigger file, you can use n, right? So again, what we might do would be like split hyphen n. And then let's look at our data ghtmd all targz. And for our n, we could put in here like 40. And so this is going to take that tarball and split it into 40 files. Now, what you'll notice is that we now have a whole bunch of these files that start with x. And so these are the files. And so we talked about last time wc. I can do wcl and x. And I'll put a star because that will expand to match anything that starts with x. And so what we find is that we now have these 40 files that each have different numbers of lines in them, but are roughly the same size, right? If I do lslth on x, I find that they're all 84 megabytes in size, right? So again, that's what you should notice is that the split file by default comes out as x a, x a, b, right? And so it's an alphabetical order. So that's putting things by n, right? And so if I do it instead by b, let's go ahead and do let's do 40 million. I think that's right. Yeah. And I forgot to delete all those other files. So I'll do remove x. And then I'll rerun my split. And then let's look at lsx. And we can see all those files. And again, if I do wcl, I can see that there's 88 files. And if I do lslth on those x files, I see again, they're 39 megabytes. Because again, 40 million bytes is different than megabytes, right? And so there's a little bit of a conversion factor there. The third approach that I will frequently use is to split things by the number of lines. So that doesn't totally make sense for something like a tarball. But maybe we could think about it for the all files file, right? And so what we could do would be split hyphen l and let's do 1000. And so that will generate files with 1000 lines each on data ghc nd all files dot txt. And so this is a file that should have 122,000 or so lines. And so we should get out 122 files. So that should write over all the existing x files. And again, ls on x, and see that there are 123 files, 122, and change, right? And then if I do head on xaa, I will then see the first 1000 lines are in xaa, right? And so that if I did like head on xab, I won't have that header line, because that was only at the top of the first file. Okay, so split is a really useful tool for splitting up our files. You can split up a binary file, you can split up text files. Typically, what people will use split for is to have a big archive that they then split into like 1000 files, they then upload it to a web page, or they move it to a different drive, knowing that sometimes internet connectivity isn't what it should be. And so they might lose the connectivity, right? And so they'll see, well, we've got, you know, 50 of these 100 files moved, I now need to move the other 50. And so then once they get to that remote computer, they can then concatenate them all back together. And it will be as good as the original file. So we're going to use the split operator as part of our pipeline. And so to this point, what happens with because we've got that tar big O, it's going to decompress our tar ball to the stream. It runs that stream through grep to remove only those lines that have PRCP precipitation data. Then what we can do is we can then split that stream into, you know, 1000 values or 1000 lines per file. And then we can put that into G zip. So let me show you how we can do that. So again, we will pipe this into split. I'm going to split it into million line files. And so we've got 36 million lines in the output of this pipe so far, right? And so that will give us 36 different files. We could go for more files if we wanted. But let's start with a million and we'll see how we do there. And then because split doesn't output to standard output, what we can use instead is a argument called filter. And filter, then we can give it a command to run on our data. And so that what we're going to run on that split is G zip. And we're then going to output, like we did here, right, we redirected the output of G zip to a directory. And so I'm going to put this into data forward slash temp forward slash. And then I want the file. So I'll say file dot GZ. And so again, if we look at the help for split, we'll notice that somewhere up here there is a there's a filter here filter command, right to shell command. So that's the command that's here. File is name file name is dollar sign file, right? So we're going to put that to dollar sign file. And that then again, should take our data, it's going to decompress and pull apart the archive grep to get the lines that contain PRCP, and then split that into individual files about 36 or 37 files that each have a million lines, it'll then compress it and it will output it to data temp dollar sign file GZ. So I need to make that temp directory. So I'll do mkdir hyphen p data forward slash temp that hyphen P says make it. If it's already there, don't give me any noise about it. So I'll go ahead and run these two lines down here. So while this is running, let me show you what's going on in data temp. If I do LS data temp, I see it's generating those xaa xac xav whatever files, but with the GZ, right? And so if I do LS LTH on data temp, at the top here, I will see that it's building xas.gz. So you'll don't see any other files here. And so the beauty of these bash tools that are streaming the output is that I'm not actually saving anything to the disk. And if I do top, and I start by memory, I see that, you know, I don't have anything here that's really using tons of RAM as it does this operation. So again, the nice thing about the bash tools is that they're really lightweight in terms of what they demand of your computer and space. Whereas if I were to do this in R, again, as we saw, it just uses tons and tons of RAM and tons of space. So that all ran through. And again, I could do LS LTH on data temp to see that I've got those 37 or so different files. They're all close in size. But again, they're going to be different size because they have been compressed and different files will compress differently based on the contents of that directory. Great. So now we have these 37 individual files that are basically subsets of our overall tarble. So the next thing that I want to do is I want to take one of these compressed files and run it through my read deal, why file script. And what we can then do is we can read it in, we can clean it up, we can pivot it longer, and then we can output it, right. And so then we'll get another 37 outputted files that we can then compress together. And then we can clean everything up. So we're now going to move and do things in R. I'm going to change the name of this R script. So doing this in a get friendly way, I'll do get MV code, read DLY files, I'll move that to code read, split DLY files. R. Okay. And so now that I'll close that, and I'll reopen this. Great. And so, as always, I want to practice this with smaller files, smaller number of files before I do all 37. So I'll go ahead and get these libraries loaded here in R. And we want to make sure that we've got quadruple width headers, right. And here I'm going to put in data temp XAA dot GZ. And this should run rather smoothly. So that took a second or so to run. If I look at the top currently, I see that my R session is using 392 megabytes considerably smaller than the nine gigabytes that we saw at the end of the last episode. So we can then run this through the rest of the pipeline here. I'm going to hold off on writing it out for now. So that outputs a data frame with 29 million rows. And one thing I notice just in these first 10 is that seven of the first 10 rows are precipitation values of zero. I don't need to keep those around, right? Because ultimately what I'm going to look at is the sum of the amount of precipitation in a window of time. So I'm going to come back to my code and add a filter to remove those rows that have precipitation of zero. So I'll do filter, prcp, not equal to zero, add a pipe to that. Also, I notice here that I've got filter element equals equals prcp. I already did that in my grep statement. So I'll get rid of that. I also don't need the select statement, because I can remove element from here as well, where I did the call select. So little things we can do to make our code maybe run a little bit faster and simply. So I'll go ahead and run this. And we should really drop down the number of rows in our data frame. So now we see that we've got about 5.8 million rows in our data frame. But again, we're going to have 5.8 e to the six times 36. And so that's a big number. That's going to be 208 million rows to our data frame. So I want to think about other ways that we can perhaps simplify each of these 36 different files to make them smaller. So one thing that occurs to me is that we could go ahead. And at this point in the pipeline, we could filter out those days that are outside of the window that we're interested in. And so again, constraints force you to make certain decisions and have certain trade offs. And we're not all the way through the pipeline yet. So it's a little bit uneasy to say, well, I'm going to set a 30 day window for my analysis. But I feel good doing that. And perhaps we could make it a variable that the user passes in to set the size of the window. So I think what we could do is we can go ahead and filter on date to get things that are within 30 days of today or any other day that we might give it. Also, what we can do is for each window for each year and for each station, we could go ahead and sum up the precipitation, right? So we can do those two things to really make I think the output a lot simpler. So the first thing I want to do is go ahead and define the window, right? And so at this point in the pipeline, I'm going to go ahead and do a slice sample. And I'll do n equals 1000. And I'm going to go ahead and output this to a variable that I'll call D. This is a right arrow. It's the same as the left, but in the other direction. So again, looking at D, we now see that we have 1000 rows of our data frame. This will be a lot simpler to work with for kind of doing some quick manipulations for figuring out what is in the window. Okay. And then once we're done, we can merge our two pipelines together. So we'll go ahead and take D. And I'm going to do a mutate. The strategy that I'm going to use is for each year, I'm going to calculate the Julian Day. So the Julian Day is also called like the year day. So, you know, January 1 is the first Julian Day. January 10 is the 10th Julian Day. December 31 is the 365th Julian Day, right? And so by keeping track of the Julian Day, we can say, well, today is Julian Day 300, I only want to go back 30 days. So I want to look at things that are between 270 and 300 for being in my window. Okay. So to get that, we're going to use the Luber Date package, which we already have loaded. And so we'll then say Julian Day equals Y Day on date. And so now we get this Julian Day column, right? So January 13, it's Julian Day 13, right? I now want to generate the difference between the Julian Day and today, right? T Day, Julian. And that we can then say is today, right? So today is going to give us the date. And of course, I can then wrap that in Y Day as well. To get date today is day 267. I'm recording this on September 24. And so now we can use that to calculate the difference between the Julian Day and today, right? And so we'll then define a variable diff to say T Day, Julian minus Julian Day. And so now we see that we've got days that are negative, which would be a couple of days down the road, right? So this is September 30, 1934. And so this is six days from now in terms of Julian days, whereas this June 19 was about 97 days ago, right? And so I can now begin to create a case when statement to say is in window. And we'll do case when so case when diff is less than window. And so my window, I will also define as a variable up here, right? And for I think 30 is a pretty good window that we'll use. And so now if if diff is less than window, then that is within the window, right? So that'll be true. But diff could be less than window and be negative, right? So like negative six is less than window, but that's not within the window, right? So I could say diff less than window and diff greater than zero, then that should be true. Running that, we now see that these values are true, like this one, right? The difference is six, that's true. So now we want to build out the other scenarios for where it's false, and see if there's any special situations. Okay, so if diff is greater than window, then that's going to be false, right? So like this case, this is 254 days ago, back in January, that's not within the window, right? So we'll go ahead and run that. And so now we see, yeah, so like January is false, but September 18th is true. Okay. Now we also have these dates that are in the future. And so that's not what we want either. If diff is less than zero, then that should be false as well. So that takes care of all of our differences. What if we were looking at a date in early January, and we wanted to go back 30 days or 100 days, right? So if we're on January 1st, and we're looking back 30 days, well, then our difference is going to be negative, right? And so there are, you know, times at the beginning of the year, say, where we would want to have a negative value. So to test that out, I'm going to go ahead and add a modified variable. So I'll do T day, Julian. And I will set that to be equal to this January 14th. So that'll be the 14th, right? In my window, I'll go ahead and make that 100. So if we go back 100 days from the 14th, we should be sure to get these October dates that they should be within that window. Okay. So run that. And now if we rerun this pipeline, we find that, yep, this January 13th is in the window. But these October dates are not in the window, right? And so we need to modify our code so that those are included. So the first situation would be like T day, Julian would have to be less than the window, right? So we'd have to be within the first 100 days in this case of the year, our difference plus 365 would have to be less than the window, right? For that to be true, right? So again, if Julian day, so day 14 is less than 100, the window I'm using for this testing example, right? So that would be like these cases, you know, yes. Today, the 14th is less than 100, right? And so then the difference, so this negative 278 plus 365. So if I do that plus 365, this is less than 100. So this should be true, right? So that's still within the window. Okay. So if we go ahead and run this up, I forgot a comma at the end here, I should go ahead and make this true. And so now what I see is that yes, this January date is in the window. And these two October dates are also within the window. So I'll go ahead and break this line across two lines to keep it from scrolling around. And I'll go ahead and put back our today Julian and our window and rerun this. And so now again, we see that we've got those two dates that are within the window. So I can then pipe this to a filter for is in window. And again, is in window is a logical already. And so I don't need to worry about saying is in window equals equals true, because it naturally evaluates to true or false. So if it's true, then we'll get those lines out. And so now we see that we've got all values of true for the column is in window, and we've gone from 1000 lines down to 78. So that's really simplified things down considerably. Again, I'm only interested in summing together the precipitation values for the cases of things being within the window. So to sum together things that are in the window for the same year, I need to go ahead and create another column. So I'll do year equals the year function on the date. Again, this gives me my date. And so that works. Again, if I've got the situation though where I'm at the beginning of the year, then I'm going to have things from the previous year that need to go with the current year, I'll go back to having today, Julian, be 14, and my window be 100. Because I need to do some logic on this year column. So to add the year to those December dates, I'll go ahead and use the if else function. So if else, so if the diff is less than zero, right, so for like, these cases, right, where it's got like a negative value, then I will say diff less than zero, and is in window, right? So if it's from the previous year, then I want to take year and add one. Otherwise, I want to use the current year. So now I have the correct year I have that it's in the window, I can go ahead and now summarize this by doing a group by ID and year, right? So we're looking for each station in each year for the data within the window, right, I can then do a summarize. And I can then do PRCP equals sum on PRCP. And I now get three columns back the ID, the year and the precipitation. Now, remember that we've got 36 files, and there might be some weather stations that are cut across different files. So we'll want to repeat this. Once we've concatenate everything back together, I'm going to go ahead and do the dot groups equals drop, probably not super necessary to do that for this application, right? So since I've shrunk the data frame so much, I wonder if I really need to output this to a file, or if I can't just do all 36 37 files at once, and then output the single file. So I'm going to go ahead and bring this pipeline together. I'm going to get rid of that slice sample because we want all million lines, not just 1000. And let's go ahead and run this. And I actually want to leave off the right TSV. I want to see how big it is because maybe we can do all 37 files in together without having to send them out to temporary files that we then later pool together. So maybe we can do the pooling all right here. So I like this idea of not writing things back out to another temporary file. So I'm going to go ahead and put this all in a function. And what we'll do is we'll say process x files, and we'll do function on x, right, and then we'll grab this, tab it over a smidge. And I'm going to then replace this path with x because that's going to get passed in from our map function. And I'll close that out. And yeah, so let's go ahead and load process x files. I'll do process x files. And we'll give it data temp x a b dot GZ. And there again, we get similar number of rows in our data frame. So I now want to create a vector that I can map over to do this process x files. So we'll go ahead and do list dot files data forward slash temp. So that gives us our GZ files. I want the full path though. So I'll do full dot names equals true, gives us our paths, I'll call these my x files. Right. And then we can do map DFR to map over all of the lines of our vector, right. And so I can then do x files. And we'll send that to process x files. Because this might take a little bit of time to run, I'm going to add something in here to print out the value of x. So I know how many files we've processed. Alright, so now I can go ahead and run map DFR. So that took about 10 minutes to run through. And you'll see that the output is about a 2.9 million row data frame. As I mentioned, we summarized within each of the 36 files, but there may have been stations and years that spanned different files. So I'm again going to do my trick where I call this D for data and assign that the value of the last value. Again, last value is the value of what's coming through the pipe. And if I do D, I can then pipe that to group by ID and year, and then summarize. And again, we're going to sum the precipitation. So we'll do PRCP equals sum on PRCP. And that reduced the number of lines by about 10. So not a big difference, but but still important, I'll go ahead and do the dot groups equals drop. And I'll go ahead and pipe this out to a right TSV. And I will save this into data. And I'll call it ghcnd, tidy dot TSV, and GZ, right? So it's going to be a G zipped output file. So if I look at the size of this LSLTH data on that, I see my tidy version is 15 megs, very, very palatable size file, that's great. I'm going to go ahead and bring this up to the whole map pipeline. And that will then be the output file, right? So I want to make this executable, because remember our bash script, I want this to be in my bash script. So to do that, we need to add a shebang line to the top of our R script. And so that's pound exclamation point user bin ENV R script, we can make that executable by doing chmod plus x on code read split dly files, I can come back to my bash script, and then do code forward slash read split dly files dot R. And then to clean things up, I need to get rid of all of these temp files, right? So I'll do RM hyphen RF data forward slash temp, and that will then get rid of that directory. So I can put in my target, I can actually remove this ghcnd cat file, because that was the previous output of concatenate dly bash. So I'm going to go head back down to that, right here, right? And I'm going to call this, I'm going to bring it down to the end. And we'll save it there. And I'll say this as summarize dly files. And my script, I've got the concatenate dly bash, I'm going to call this actually bash script. And then we still want that turbo I have an R script. And that is going to be code forward slash read split dly files. Remember the comma. And then here for the shell I want input bash script. And my output file again is going to be this target here, which I can plop in here, save that. And so we have our rules again, the inputs data input is the table. I have these two scripts. Although I only call the bash script from the rule itself, I still need to include the R script as a dependency. And I've got the target up here. So now what I can do would be snake make hyphen C one. And by default, it will use that first rule targets. And it'll hopefully see that those first four targets already made. And I'll go ahead and make my tidy TSV GZ should take about eight or nine minutes. And hopefully everything works well. Wonderful. So that took about 15 minutes to run the concatenation from the tarball all the way through processing here in R. If again, I do LS LTH on data, I see that my temp directory is gone, which is good. And I've got my GHC ND tidy TSV GZ file, 14 megabytes. Awesome. So let's go ahead and clean up some things. I see that in my project root directory, I've got all those x files. I'll go ahead and do RM x star. Looking again in my home directory, I see I've got a hold. I'm not sure what that is. So I'll go ahead and remove that hold. Hold is the directory. So I'll do RM, RF hold, that must have been something I was making while I was practicing. Again, that looks good to get status so we can commit everything that we've changed. So we'll go ahead and then do get add. I'll use the period because I know what I'm committing. Usually that period is a little bit risky. I'll do get commit hyphen m, I'll say summarize precipitation for window for each station and here. Get status. I'll go ahead and get push. And we are good. So I know this was a bit of a long episode. We covered a lot of really cool topics. The main bash concept that I covered with you all was the split function, as well as seeing how we can tie together bash with our scripts. Again, all the stuff that we did in our none of it was really new at all. But again, seeing it in different contexts. And again, seeing it in the context of running it from a bash script. So I think we've made a lot of great progress today, although it was long and winding road. And we're ready to go on and start thinking about how we're going to further summarize our data to make that great visual that we're so excited to make. Well, so that you don't miss that episode, please make sure that you subscribe to the channel, you click the bell icon, give me a thumbs up and let your friends know what we're doing here. All right, take care, and I'll see you next time for another episode of Code Club.