 Hey folks, welcome back for another episode of Code Club. As you probably know by now, I am in the midst of a series of episodes where I am trying to build a data visualization of the world, showing the level of droughtiness for say the previous month relative to that same window of the year for the past 100 years or so. To do this, we are downloading a large dataset from the NOAA website. It comes to us as a tarball and it's 3.3 gigabytes in size, compressed, decompressed, it's 29 gigabytes. The engineering challenge that I have is that ultimately I want to be able to put my entire project up onto GitHub to use GitHub actions and there they limit me to 14 gigabytes of drive space, 14 gigabytes of RAM. I get three processors and I think I get some amount of time. But the challenge that I'm running into is this tarball, right? And so if it's 3.3 compressed, 29 decompressed, I clearly can't decompress it and then work with the files individually. In the last episode, we talked about using the archive R package to read individual files from that tar archive. The challenge though is that with a single processor it was going to take about 17 or 18 days to get all the data out of the archive that way. Sure, I could parallelize it to use those three processors, but we're still looking at about a week just to get the data out, right? And so I think that actually is beyond the length of time that a GitHub actions can take to run. So this again is a constraint and I was really annoyed by this because it's just something I didn't want to have to worry about. But I kind of like constraints in a way because they force you to think differently and to really think about your tools and think about your problem, try to kind of think through things differently. And so this got me thinking about the tar tool in the command line, the bash system that I'm using here on my Mac system that you can also get for Windows through the bash subsystem. And that obviously will also run on Linux and thinking about tar and whether or not there are options that I can use with tar to get the data I want in the format I want without having to decompress the entire tar archive. So I'm going to head over to my virtual studio code and kind of walk you through some of my thinking. As we go through this, the fundamental tool that we are going to be using to achieve the successful outcome is a pipe. And again, we've talked about pipes in R, but now we're going to talk about pipes in bash. Go ahead and open up a blank document to do some work in. We might save this in the end, but for now, I want to be able to have some place that I can type the commands I'm running down below in the console. Okay, so let's think about the problem we have. So again, we have 122,000 files that are all connected together in a big tar ball, right? That file then was then compressed using gzip to make it smaller, right? And so what I'd like to do is to get out of that tar ball, the individual files that are, you know, in that whole tar ball, but I want each of them to be compressed. My hope then, with that compressed, they'd be about the same size as the compressed tar ball. That's not exactly the way the algorithms work, but it should be pretty similar. Certainly, it should not be 29 gigabytes like the decompressed version is. Let's go ahead and look at the documentation for tar. So we're going to tar hyphen hyphen help, we can then see the arguments for the tar function, right? And so we see C for create. So we can see F for the file name we've used this before V for verbose, right? So in the past, we did like tar X V Z F. So V for verbose F for the file name where we give it the archive, Z is for compressing it with gzip, right? And then tar X is right here of extract, right? And we've also seen TV F, right? And so T is to list V for verbose and F for the file name, right? We could also create an archive, where we might do like CV Z F. So create verbose, gzip compression, and then the file name being the F, right? So again, there's a lot we can do in here. Something else we've seen in here is the hyphen C to change to whatever directory you state before processing the files. As I look through here, there's nothing that really sticks out at me as a way to get individual files out, right? And so to remind you, let's come back here and grab one of these dly files, any of them will do. And so we can do tar X V Z F data, and then ghcnd all dot tar dot gz. And I'm going to extract the ghcnd all, and then that deal of y file, I guess I didn't need to type that in there. So running this then will make a directory on my project root directory called ghcnd all, and then the name of this dly file. And so you can see how long this is taking to extract a single file. It wasn't just the functions from the archive package that was slow, but this process of removing a single file is actually quite slow. So this then extracts the file. Again, I now see I've got that ghcnd, ls ghcnd all, and then I can look in there and I've got this file. I'm going to go ahead and extract a couple other files so that I can have something to play with here, as I'm testing out different approaches. So I'll go ahead and grab these other files. And so I'll do I'll repeat my tar, and I'll give those three files. And so we'll extract these three files for practice into a ghcnd all directory that's in my project root directory. So I've extracted these three dly files, I'm going to go ahead and make a practice tar ball. And so again, we'll do tar cvzf again that C is for create this x is for extract so C create x extract, and we'll create a practice tar.gz. I could also call this tgz. You'll also see that as a shorthand for tar gz. And then I'll give to that ghcnd all. And so that then adds those three files to my tar ball. And I get I can do the listing. And so the listing again, do you remember that argument? It was a T. So I get tvf on practice tar.gz. And so then I get my ghcnd with those three files included looking back at the arguments for tar. Nothing really stood out as a way to kind of sequentially go through and extract each of these files in turn, right? So again, my vision was to pull out a file and gzip it pull out the next file gzip it and so forth, right? But if I think back about what I really want, what I really want is a concatenated file where I want all 122,000 files concatenated together. An advantage that I have is that each of those DLY files starts with the name of the weather station, right? So I've got this name here US1 IACW0005 daily. If I do a head on that, I can see right here, if my screen stops moving, right here, I've got the same name as I have right there, right? So again, if I can somehow get all of the files out, concatenate them together, and then compress it, that would be successful too, right? And so again, looking back at these arguments, there is one that kind of sticks out at me as being interesting, hyphen O, it's not a zero, it's an O will write the entries to STD out, don't restore to disk. So STD out is standard output, right? So if I were to do tar and then capital O, xv, zf, so the same arguments I would use to extract on my practice.tar file, this should output the contents of my entire tar ball to the screen. So running that, I see sure enough, I get everything outputted that is in that practice.targz file. But the next thing that I want to do is somehow write this out to a file, right? So I can write this to a file, right? By redirecting it with that greater than sign, I think we talked about this a few episodes back. So I'll call this practice.output, right? And so what this means is create a file called practice.output. If it already exists, recreate it as a blank document and take this stuff on the left side of the greater than sign and send that to practice.output. If I have two greater than signs, then that's going to append to the file if it already exists. And so now I've got practice.output. So if I now look at my practice.output file, I see that I've got all of my different data files in here. So there's the one with the 008, 006, 005, I think these are relatively small data files, right? But you can see that they are concatenated together. Cool. So let's come back to our terminal, right? And what I might like to do then is gzip, right? So gzip is applies that compression algorithm to any file I give it. So I can give it then practice.output. And now if I do lslth, I see that I have practice.output.gz. And this is a compressed version of practice.output. Again, I can do gzip-help to get the help documentation of the gzip function. And to kind of help us to see what's going on, I can use the hyphen k to keep the input file. So again, I can do gunzip to unzip the gzip file on practice.gz. Sorry, it's practice.output.gz. And again, I can repeat that gzip giving it hyphen k to keep the input file. So again, if I do lslth, so I see the concatenated file practice.output is 47 kilobytes. So it's a little bit more than 10 times the size of the table. But then when I take this and zip it, right, it then becomes 3.8 kilobytes. So that's actually smaller than the input file. So that's pretty cool. All right, but you'll notice that we did this with an intermediate file. I don't want to generate this file because it's too big to store on the computer that I'm going to be using. So what can we do instead? Well, let's come back and take the different commands that we've already entered. And so to see the history of the commands I've run, I can do history. And history then outputs all of the commands that I have run this far in my session. I guess it goes back quite a ways because this is the number of commands I've run. I haven't run 531 for this episode, trust me. All right, so what we can do is go ahead and grab 523, this one. And we'll go ahead and add this gzip hyphen k. So if I didn't want to have that intermediate file, I could remove the hyphen k. I could again paste these down in. It's saying actually it's warning me that practice.output.gc already exists. Do I wish to overwrite? Yes. And again, I could look at gzip hyphen hyphen help, and there's a hyphen f to force. So I could do gzip hyphen f to force it to output it. So I ran that and I no longer get that warning message. We've got two steps. We have this practice output as an intermediate file. That's not what I want. So something that is cool about Linux commands is that you could actually pipe them to each other. So as I showed you, we could do like ls ghc nd all that shows me there's three files in there. But say I have more than three files, or I didn't want to look, I didn't want to have to count, right? How would I count the number of files that are in there? Well, there's another cool tool called wc. So to show you a little bit about wc, we can run it on one of these dl y files ghc nd all us one I all that running that then gives me four columns of output. This 44 is the number of lines in the file 1971 is the number of words in the file, and 11 880 is the number of characters in the file, right? So I could do wc hyphen hyphen help. This complains because it doesn't have help with it. So the alternative would be to do man for manual on wc. If I make this larger, I see the manual page for wc. Not every program from the command line has help documentation that you get with hyphen hyphen help. Not everything has a man page. So it kind of depends. But what you'll notice is that there's a variety of different arguments that you can give wc. It has clm and wc is the number of characters, l is the number of lines, m ends up being the same as C for most purposes, and hyphen w is the number of words, right? I can go ahead and do q to get out of that. And so what I want to show you that is if we can do wc hyphen, l ghc nd all us one I all that, I get 44, right? So what I can do now is if I have ls on my ghc nd all, right, I can now send this to the wc function. And that is with a pipe. And so this vertical line in bash is the pipe character. This gets confusing, right? Because in R, the percent greater than percent is the Magridder pipe. And the new pipe in base R is a vertical line with a greater than sign, right? In bash, it's a vertical line that actually looks like a pipe. Whereas the others don't really look like a pipe. Anyway, I can take ls ghc nd all pipe to wc hyphen l. And that then will tell me three, which means that there are three entities here in this directory. Okay, so we can take output from one bash program and send it to another program. That's pretty cool, right? So again, what does this have to do, Pat, with what we are doing here, using the tar and gzip function, come back to our script here, I'll go ahead and close this dly. And what I could do is because this pipes out with capital O to the standard output, I can pipe this into gzip. So if I copy that and run it at my prompt, it then says gzip standard output is a terminal ignoring. So I need to redirect this to a file that I'll call practice underscore pipe dot gz copy and paste this down. And so now I didn't get any of the errors that we've gotten the previous execution. If I ls lth on that to again, list it out and get the sizes of the different files, I see that practice pipe and practice output are the same size. And you'll notice that nowhere in here did I create the intermediate file. I think we know enough now about tar, gzip, the pipe, and some of these other great tools from the command line that were ready to go ahead and do the processing on our Noah data to get that concatenated and compressed file without first going through an intermediate file. So I'm going to go ahead and save this script to my code directory. And I'll call it concatenate dly dot bash. I'm going to come to one of my other files and grab the shebang line there. So I can paste it up there. And we can now save that. Before I forget, I'm going to do a chmod plus x to make it executable on code concatenate bash. And so now it is executable, which means I can do code concatenate dly dot bash. And that will run everything in my command line without me having to copy and paste things back and forth. But of course, we want to process our ghc nd all file, right? So again, the file is in data. And so if I put the star, I'll get the full path. And so we see we have the data ghc nd all dot tar dot gz. I'm going to go ahead and remove those lines and plop the path here. So we'll again output everything to the standard output, we'll compress it, and then we'll export it. And I'm going to call this ghc nd cat dot gz. Okay. And so this should work. And I'm going to go ahead and run this and I'll keep track of how long it takes to run to see if it hopefully is a lot faster than 17 days. Okay. So that took about 12 minutes to run through. If I do lslth, I noticed that ghc nd cat I accidentally left in the project root directory. That's because I didn't have data forward slash here, right? So if I go ahead and move that, again, move another program from the command line of ghc nd cat to data. And I do an lslth on data. I now see that I've got ghc nd cat, again, 3.3 gigs, ghc nd all the tar ball is also 3.3 gigs, right? So the concatenated and then compressed file is the same size as that tar ball. So as this was running, I was thinking about some other ways that I could make this even smaller. So one way that I could make it smaller is by noticing in one of these dly files, that one of the columns that we've talked about in previous episodes, again, this is a fixed width formatted file. And so there's a four character or four column variable here of the type of data being stored, right? So here we've got PRCP, which we want snow, which we don't want DPR, I don't want, right? And there's a variety of others in these files. But really, all I care about is PRCP, the precipitation, because I'm looking at drought. What I could do is add to this pipeline a grep function. So I could do grep prcp. And that then will take the stream of data coming out of tar, run it through grep, which will return any line from that stream that contains PRCP, send that to gzip, and then compress it as the outputted file, right? So I think this couldn't actually make it quite a bit smaller. So let's go ahead and run this. And it might take another 12 minutes. But let's see what the size shrinks down to after we've run this. So that only took about four minutes. So that actually ran quite a bit faster by getting rid of everything except for the PRCP lines. Again, if I do LSLTH on data, I now see ghtndcat is 850 megabytes, which is quite a bit smaller about a fourth the size of the ghtnd all tar.gz file. Great. So I think we've done a good job of getting out a small concatenated version of the file much faster than three weeks, or even one week. And that's a lot more attractive. So again, I know I often get people in the comments that say, why are you doing all this stuff in Bash when you could be doing it in R? Well, this is a case where actually I can't do it in R because it's going to be too slow. Instead, doing it here in Bash makes things a lot faster. And with, you know, knowing a few little tools and concepts such as piping things through and redirecting the output to a file, we can get a pretty good result fairly quickly. I'm going to go ahead and save this. I need to add this to my snake file. And so let's go ahead and minimize that. And we'll come down and I will add to where I have on git all archive. I'll call this rule. concatenate dly files. Right. And then my input, as we've seen in the past, we'll do script equals code forward slash concatenate dly dot bash, right? And then my output. What did I call it? The output was this, right? So it's always easier to copy and paste. I don't make so many mistakes that way. The other input is that Tarball, right? And so I'll call Tarball. And so even though I'm not going to use Tarball as an argument, because I've used script as an argument name, I have to also name the Tarball, right? That's a convention within snake bake that they enforce. Otherwise you get errors. And also if you have two values within a directive, they have to be separated with commas. And so then for my shell, I'm going to go ahead and copy this and kind of clean up my my spaces here. And I don't have a params file. So I'll get rid of that. And so now I will add the output here to my targets, right? So I can add that. And so again, when I run snake make, it will also generate this file if it needs to be updated. So let's go ahead and update our R script, right? So I'll go ahead and remove some of this code that I don't need anymore. I don't need this stuff with the map. So I'll go ahead and remove that, remove this sys time. And the read fwf can read directly actually from a GZ file. So I'll do data forward slash ghcnd cat dot GZ. And so I've got an extra parentheses at the end here. So I'll go ahead and run everything to that point. So that took a couple minutes to read in. Again, it's a large data file. And it's compressed. So whenever I talk about reading things in with big data files, I generally get a couple responses. So one is the suggestion that we use the f read function from data table. My investigation of f read is that it will not read in fixed width formatted files. So that won't work here. The other approach people suggest I use is to use the functions from the room package room room, like a car or motorcycle revving its engine. Well, it turns out that the developers of room are the same as the developers of read R. And if you look at the code inside of read fwf, it's using room already on the back end. My understanding is that they're pivoting from room to read R to make kind of a seamless transition there. So that more than likely we're already getting the benefit of the benefits of using room. So that's all good. One thing I want to check is the memory footprint of my data. So if I look at top and top is a handy dandy function, another program from the command line that will show you information about the different processes running on your Linux based workstation. If I hit the o key, this will then allow me to type in different keys that I want to sort this table on. So if I sort on mem, that will sort on the memory column. So we find then that R is using close to 10 gigabytes of RAM. You recall that I said that with GitHub actions were limited to 14 gigabytes of RAM. And so we're getting kind of close to the ceiling. And so we'll have to see what happens when we use the rest of the pipeline to go ahead and pivot this longer. Again, our days now are in the columns and we're going to pivot that longer so that we have a column for the station, the date and the amount of precipitation. So I'm a little bit worried that we will go over the ceiling of memory utilization. So if you want to find out the exciting conclusion to that story, please be sure that you check out the next episode, which I will link over here. Keep practicing with all this. I hope you enjoyed learning more about bash functions, and we'll see you next time for another episode of Code Club.