 Hey folks, welcome back for another episode of Code Club. In this series of episodes, I am trying to build out an image that shows the world colored by the degree of drought being experienced at that latitude and longitude. This is going to take a number of steps. In the last episode, I showed you how I go about organizing a project directory, how I connect that with GitHub, and then how I keep track of my software dependencies using a tool called Kanda, along with a companion tool called Mamba. In this episode, we're going to get the data that we are going to be analyzing with our project. And while we've certainly seen in previous episodes, how we can kind of do this using the browser as well as using R. In today's episode, we're going to use a different tool called Wget. And we're going to write a bash script using Wget to download the files we want. And we're going to do that, like I said, using bash scripts to make it all automated so that it runs directly from the command line without having to go into R or use any other tool like a browser. I'm over in Chrome. I'm going to go ahead and create a new tab. I'll do a search for NOAA GHCN. And so we'll see what that gets us here. And so that gets us a variety of different things. So global historic, climatology network, that's GHCN, the daily. And there's a variety of different files. And I'm pretty sure that we actually don't want the top one. We want this third one. This brings us to the daily summaries page. And so we want the daily data, right? We don't want it aggregated. And so as you can kind of scroll down through here, you see there's a variety of different types of data available to us. I want the NCEI direct download. This brings us to a web page that looks very low frills, right? And so we can see there's all sorts of different links in here. And if I scroll down here, there's a readme.txt file, maybe zoom out a little bit. All right. And so you can see that it tells you information about how to cite this information. And then the download quick start. So what do you want to know, right? And so start by downloading GHCND stations.txt. So remember that, which has the metadata for all stations, it doesn't have the actual data, but the metadata for each of the stations. And including like what where the station is, right, the latitude, longitude of each station, maybe information about where it is. And then we want to download one of the following tar files, right? So there's GHCND all, GHCND, GSN, GHCND, HCN, right? And so if we want all of the data, we want this all.tar.gz file, right? And so we can then uncompress and untar the contents to get like 100,000 different files to include to get all of the daily weather data for each of those weather stations, right? And so if you come down here, you'll see that the all is a directory with dot dly files, which are the files for all of GHCND daily, right? And so that's again, what is in this GHCND all tar.gz file, it is the compressed version of the all directory, right? And so we see that down here, right? So the all tar is the tar file, the gzip compressed files in the all directory. Yeah, so here now I see GHCND all tar.gz. That's 3.3 gigabytes. It is big. That's what we want, because that has all of the data that was in the all directory, right? Okay, so I'm going to come back to my Visual Studio code. And inside of code, I am going to make a new file. And so I will then call this get GHCND all dot bash. And so now this is a bash script. And so at the top of this bash script, I need to indicate to bash that this is a bash script. So that will then know that all of the commands that follow are bash commands, kind of like what I might run down here at the dollar sign here in my drought index directory. So the first line is the shebang line that starts with pound exclamation point, this tells bash or the interpreter that's running the command line, what types of commands are going to follow in this script, right? So I'll do forward slash usr bin env space bash. And so again, that tells bash my command line or interpreter, what types of commands are about to follow, right? So what I could do would be to say LS, right? So I'll go ahead and save that. So down here at my dollar sign, I could type bash, and then code forward slash get GHCND all bash. And so that is then running the LS command, right? And so that is running that command, much like if I were to run it down here, right? So if I were to type LS here, I'd get the same thing. Okay. So this is kind of a silly script, but I'm using it to illustrate a couple of things. I don't really need to say bash, and then the name of the bash script, because I can make this bash script an executable, right? So if I do LS dash LTH on code, I can then see that get GHCND all bash is read writable, right? So it's readable and writable. But it's not executable, right? So I can make it executable by doing ch mod plus x on code forward slash get GHCND all bash. And now if I rerun LS LTH, I now see these x in here, which indicates that get GHCND is executable, right? So now I can do code forward slash get GHCND all bash, run that, and it runs that LS command without me having to say the type of software that needs to run the script, right? Because it's going to grab that from the shebang line, right? So again, I'm sorry, that's if that's a little bit long-winded. But that is a way that we can get an executable script. Now what this allows us to do is we can put any commands in here that we want to be able to run from the command line, but that we don't want to have to manually type out each time. So we'll come back here, and I'm going to get the link to GHCND all tar dot GZ. So I'll right click and then get copy link address. And now what I could do would be to do W get on that link, right? And so one thing to know is that W get doesn't come with a Mac by default. Who knows? And so what I could do instead is I can add W get to my conda environment, right? So I need to make sure that I'm in my conda environment. So I can do conda env list. I'm in my base, right? So I need to conda activate drought. So I'll come over to my browser and do W get on conda. And what I will see is that there is a conda forge version of the W get package. So W get 120 dot three. And so what I can do is I can come back over here and I'll do Mamba install hyphen C conda forge. And then I'll do W get and equals 1.20.3 double check that number. Yep, 120.3. And we'll go ahead and run that very good that all installed. And so now I can do W get hyphen hyphen version. And I should see 1.20.3. Let's see. Yeah, W get 120.3. So I'm going to actually go ahead and come back up to the previous command. And I want W get 120.3. And I will then add that to my environment as a dependency. So W get 120.3. I can save that. I'm not going to burn down and then rebuild the environment right now. I think this is a pretty small minor addition. Maybe before I finish the whole package, I will burn down the environment and recreate it again. When I say burn down, what I mean is delete and then re recreate, right? Okay, so I can now go ahead and close that environment. And so now I have W get accessible to me to go ahead and download this tar gz file. So I would like to though have it be moved into a specific directory. So to put it in a specific directory, I can use hyphen P. And I want it to go then into data. Okay, so we'll go ahead and save that. And we'll then go ahead and let's get a fresh line here on a terminal. And like I showed you before, we could then do code. I get ghcnd all bash. And now it is downloading. And this is going to take about 30 minutes to download as it shows on my internet. I guess it's getting a little bit faster as it goes here. I'm going to go ahead and open up another bash shell here. So let's go ahead and make a new one with the plus sign. And so now if I look at LS data, I see that ghcnd all dot tar dot gz there. And I do ls dash lth, right? And so now I see that as it kind of run that command multiple times, it is getting bigger. And it is going into the right location. If I do get status, I see that I don't get the data slash forward slash ghcnd all tar dot gz. And that's again, because in the previous episode, I added to my dot get ignore file, the data directory, right? So we don't want to be putting a gigantic file like this up into GitHub. And so again, that's the reason we added that to dot get ignore. And again, this is running to go ahead and download this file, the ghcnd all tar dot gz into my data directory. Very good. So that's running. And I can come back to that by going over here. And we still have about 11 minutes to go. Initially, it says it's going to take a really long time. And then it kind of speeds up once it figures out how fast my internet connection is. So this is clearly not something I want to be doing a whole bunch of times, which is why I have a single script to do this, right? I can download this file whenever I want a new version of it without having to basically add all of my code to a single script, or I'm downloading it and processing it because that would just be painful to make, you know, subtle changes to the analysis if I had to download it over and over again, right? So again, this script will be really helpful because you can imagine rerunning the script every day to get a fresh version of the data, I'm going to now repeat this to get the other files that they suggested we get. Again, if I come back, I find the ghcnd stations and ghcnd inventory files, right? So I'm going to go ahead and create new bash scripts for those. And one of the nice things, again, about being here in Visual Studio Code is that I can have multiple terminals open at the same time. So I'm going to go ahead and create another bash script, which I'll call get ghcnd stations dot bash. And I will again, copy over the shebang line. Again, we'll do W get hyphen P into data. And I'm now going to grab that link of the stations, right, the stations dot text. So I'll go ahead and copy that link address. And then I will put that in there and save that. And we'll go ahead and do our chmod to make it executable because if I do LS LTH on code, I now see that I've got that new bash script, but it's not executable, right? So again, I'll do chmod plus x get or I need to do code forward slash get ghcnd stations dot bash. Now again, if I do LS LTH on code, I see that it is executable. And we can then go ahead and do code forward slash get all that stations dot bash. And it's downloading, right? And so this is going to take a couple of minutes to download. It's not gigantic, but it is still about 10 megabytes. Now I want to also get the inventory file, right? So I'm going to go ahead and create another code, which will be get ghcnd inventory dot r. And again, I'm going to basically copy this script over and replace this stations with inventory dot txt. So we'll go ahead and save this as get ghcnd inventory dot r. I don't mean r. I mean bash, right? So I need to go ahead and rename that. And so I can come over here, click on that in the find and explore, click rename, and then make this bash and save that. And so now that's our bash file. And I can go ahead then and do chmod plus x on code, gcnd inventory dot bash. Great. And again, I can run that to get the inventory data. Very good. So that downloaded. And again, I can look in my data directory, I'll do oslth on data. So I see that I've got the all file, the inventory and the stations, right? Those are stored in the data directory. And again, just to prove it to myself, if I do get status, I don't see anything from the data directory showing up as not being tracked or needing to be committed. Great. So again, we have these three bash scripts. I'm still waiting on the all one to finish downloading. It looks like it might be another five minutes. So I'll go ahead and do some editing and we'll check back in with you in about five minutes. So it took about 12 minutes to download on my home internet, which is not super fast, but certainly not something I want to be doing many, many times. As I was waiting for that to run, I was looking back at my three scripts here. And I realize that there's really only one difference between these three scripts. And that's the name of the file that I'm trying to download from the daily directory, right? So I think what I'll do is instead of having these three different scripts, so I'll have a single script, and I can then call each of the three scripts by giving it the name of the file I want. So I'm going to make a couple new files here. And so the first will be get ghcnddata.bash. And I'll go ahead and grab all this and pop that into here and save that. And I can create a variable that I'll call file. And I'll say that equals to dollar sign one. And so dollar sign one means use the first argument after the name of the script from the command line, right? And so then in here, what I can do is I can plop in in curly braces, dollar sign file, and I'll save that. And so now what I can do is I can make it executable so I can do chmod plus x code ghcnddata.bash. And then I know what I could do would be to do code, get ghcnddata.bash. And then the file that I want to give it would be say this inventory file, right? And so now if I do that, I'm planning that this was basically the URL it's trying to get. And it's got this percent seven around it. And so I think it doesn't like my curly braces. So let's go ahead and put just leave it alone. And we'll try that again. And so now that worked, right? So I thought I needed the curly braces, but I don't actually need the curly braces. And so we now have the ghcnd inventory dot text, it downloaded it. And so what you see now is that it saves it to ghcnd inventory dot text dot one, right? And so if I look at LS LTH data, I now see I've got ghcnd inventory dot text, and dot text dot one. And so I think I'd like to do is if that file already exists, I want to go ahead and remove it, right? So I'll then go and do RM dollar sign file to clean this up a little bit. I'll go ahead and remove that ghcnd inventory text one. So I'll do data, like I said, inventory dot one. All right, so let's try running this again. So in my RM, it's complaining that no such file or directory. And that's because file is should be in data, right? So we'll go ahead and try that again. And again, this is why we practice this with smaller files, rather than the big files, because the big files take a long time to run. Alright, so we'll go ahead and run that again. And so we don't see any error messages up here for the RM. And we see that it's now downloading it to ghcnd inventory dot text in data. So now if I do LS LTH on data, I see that I've got the inventory. Let's go ahead and rerun that then with ghcnd stations, right? So we'll remove that inventory and do stations dot txt. We get the stations no error with the RM. And that worked great. So I'm going to create a new script that I will call driver dot bash. And so this driver dot bash is going to be the driver, right, it's going to call all of my scripts. So I'm going to go ahead and give this the shebang line, right? And we'll pop that at the top of driver. And so then we can do code forward slash get underscore ghcnd data dot bash. And now I can put in the all the inventory and the stations, right? And so I'll come back to drivers. So that will run that to download that. And so let me go ahead and grab this. All right. And so now I want to put in the ghcnd hyphen inventory dot txt, I'm pretty sure it was, right, that takes t. And then we also want the stations, right, that's also a txt file, right? And so we can put that there. So again, I'll do a chmod plus x on my driver dot that I can do period forward slash driver dot bash to run that. That means look for driver bash in the current directory. So while that's all running, I'm going to go ahead and open up another bash script. And I'm going to go ahead and just in case go ahead and activate my drought environment. And I'm going to go ahead and get rid of those extra ghcnd scripts. So I'll go ahead and do RM code, get ghcnd stations, inventory, and then all. And now if I look at get status, I see that I've modified the environment file. I've got this general get ghcnd data dot bash script, as well as my driver bash. Alright, so that all ran through. Again, if we look at lslth on data, we see that we have those three ghcnd files. So to try to help motivate something for the next episode, what I'd like to do is one more step. And that will be to get a listing of all of the files in this ghcnd all targz file. Okay. And so again, this, this file is a targz, which means it's compressed. And what is compressed? Well, what's compressed is a tar file. In a tar file is a bunch of files all stuck together. If you think about tar, if you were covered in tar, you're going to stick to a bunch of other stuff, right? Perhaps you've heard of people being tarred and feathered a few hundred years ago, pretty atrocious. But things stuck together, right? And so you take a bunch of files, you stick them together, and then you compress them, right? So that's a targz file. You can use the tar function to figure out what is in there, right? So we can do tar, tvf on data, and then ghcnd all targz, running this will then output the listing of all of the files, as well as there's when they were made, their size, all sorts of other good stuff like that, right? So I had run tar tvf, that V is for verbose, that T is listing out the files that are in there, right? So if I do tar tf on the data ghcnd all, right, I then get a listing of the directory and the name of the file in there, right? So what I'd like to do is take that command, and I want to redirect the output to a special file, okay? So I'm going to come back up here to code. I'm going to create a new bash script that I'll call get ghcnd all files dot bash. And again, we will grab the chevang line and plop that at the top here. And I'm going to grab this tar command and paste that in, right? So there's no need to retype everything when control C command V are your friends to copy and paste, right? And so now I can output this to a data file. So I'll do data forward slash ghcnd all files dot txt. And we'll go ahead and save that. And now what I can do is again, chmod it to make that executable on code get ghcnd all files dot bash. And now I can again do run all this, right? So I go ahead and copy and paste this down. So while that's running, I'm going to go ahead and add that to the driver, right? And so this now will get out the names of all the files that are in the archive, right? And so maybe what I'll do to organize this a little bit better will be to make it like this, right? And so I can put in comments. So I could say get all of the daily data from all weather stations and generate list of stations, right? And then here I can add another comment to say get listing of types of data found at each weather station, right? And then this third one, I can say get metadata for each weather station. Okay, cool. So now we have comments, we have the code, and we're in good shape. Again, this finished running. So I can now look at the output file. So I'll then do head data forward slash ghcnd all files dot text. And I now see that we've got an empty directory. And then I have all the dl y files, right? So maybe what I'd like to do is modify this to only return those lines that have dot dl y. And so we can do that very straightforward by piping this. And so we've talked about pipes in our well, we can also do pipes in bash, right? And so the pipe here is a vertical line. And I can do grep. And then I can do in quotes, the pattern that I want to match. And so I'll do period dl y. And then we'll redirect that output to this, this file here, right, the data ghcnd, cnd all files dot text, and that should get rid of that empty directory, right? So we'll go ahead and save that. And now rerun this, right? So we'll go ahead and do another head on that. And now we see that empty directory, or the directory ghcnd all that we had here is gone. And it picks right up with the dl y. I could also do wcl to figure out how many lines are in the file, ghcnd all files. And then the 122,010 files in that compressed archive, right? So there's a lot there, right? Maybe one thing I'll do to give it a column name would be to give it to do echo. And I'll say echo file name. And then I'll output that name file name to this text file, right? So I'm going to copy that up here. And so the single greater than sign will create and write to that file, right? And so basically, if we run these two lines, then we're only going to get what we had here, right? Because it'll write file name that word to the file, the text file. But then in the next line, it'll create a new file and write it again, right? And so we'll lose that. So if we want to append, then we need to do two greater than signs, okay? So we'll go ahead and run this. And then I'll rerun our code. Let's go ahead and do another head on this. And we now see that at the top, we have a column called file name. Of course, this is a file with one column. But hey, you can perhaps see down the road, where this might be useful as a table, right? And having a column name for that table, right? So we're trying to help future us out a little bit as we engineer the output of these files. So where are we now? Well, we have three new bash scripts as a result of this episode. We have a script that will download data from the GHCN website. We have a file or a script that will extract the names of all the files in one of those archives. It's not decompressing or unpackaging that archive, it's getting us a listing of the file names. And then we also have this driver.bash script, right? So the nice thing about this driver bash script is it tells us all the things that need to be run. The downside of this bash script, though, is it doesn't tell us whether things actually need to be run, right? And so, you know, this script that extracts the file names doesn't need to be run if we haven't downloaded a new version of GHCND all-target GZ, right? And so, this doesn't keep track of our dependencies. And we can imagine that over the next, you know, seven to 10 episodes here, this driver script might get longer and longer, and we might have more and more dependencies between the different files and the different scripts. And so what we're going to do in the next episode is to learn a new tool that's a lot like a tool I've used in the past called Make. But we're going to use something new. And that's going to be called Snakemake. And that will make it really straightforward to keep track of all the dependencies that we need to get a final output of our drought indices shown as a map on a PNG file up on a website. And so, doing this within Snakemake, with Kanda, with Git, with all these tools coming together, will really help to make a reproducible workflow that, again, we can run every day. So be sure to check out the video that I've got linked over here that will show you exactly how we do that in this next episode. Keep practicing with this, and we'll see you next time for another episode of Code Club.