 Hey folks! I am in the midst of a series of episodes where I am developing a project with the goal of creating a visualization that shows across the globe the level of droughtiness for, say, the past month relative to that same window of the calendar for the past, say, 100 years. To do this, I'm downloading data from the NOAA website. NOAA is a U.S.-based agency, but they collect data for weather stations rated around the world that has different types of climate data. The challenge that we have run into is that my end goal is to post this code and all the compute up to GitHub as part of a GitHub actions procedure. So GitHub actions will limit the amount of drive space that we use to about 14 gigabytes. So the composite data set that I'm downloading from NOAA compressed is 3.3 gigabytes. Decompressed and extracted is about 29 gigabytes, so that's well over the size of the footprint that GitHub actions will allow. What are the alternatives to extracting the entire archive and working with individual files? Well, thankfully there is a handy-dandy r package called archive that will allow us to extract individual files from that compressed archive. Work on that and then get the next one. Work on that, get the next one, right? And so we can leave this archive, which we call a tarble, as it is compressed and extract individual files to then create a composite data frame. So that's exactly what I'm going to show you how to do in today's episode. We'll kind of see how we can create archives as well as extract archives and then apply it to our actual use case to see if it's going to help us in the long run with what we want to do. So here in Visual Studio Code I have a practice R script opened up that I will be using for demoing some of the utilities within the archive package. I'm also in my terminal here in the project root directory, drought index. Main is red because I've created this practice R script and it's not being tracked. If you would like to get a copy of my project as it currently stands with all the code and everything else, you can go down below in the description. There will be a link to a blog post that will help you to get a copy of the repository as it currently stands. If I look in my data directory, I see I have these various files that we downloaded from the NOAA website within ghcnd all is let's see a variety of files that I extracted from the tar archive. Again, we call that a tar ball. I see I have three files that I'd previously extracted from ghcnd all tar.gz. So that file that ends in tar.gz is a compressed archive, a gz file. If you see gz at the end, that means it's been compressed with the gzip algorithm. It's kind of like zip or bzip or I think those are the main ones that you might see out there main algorithms for compressing tar then indicates that the directory ghcnd all, which has 122,000 some files in it, that directory was tarred together. Everything was kind of archived together, lumped together. And then the gz means that that archive that composite entity was then compressed. In the last episode, I was able to extract these three files to give us some data to play around with. And so we're going to keep playing with these files, because they give us something, you know, it's small, and things don't take too long to work with. So to create an archive, I can use archive write files. And to that I can then give the name of the archive I want to create, right? And so I'll call this write files tar.gz. And then I have to give it the files that I want to compress, right? And so I'll come down here, and I will do rerun this ls. And I'll go ahead and give the star at the end of that ls because that will give me the full path to those files. So I'll go ahead and highlight that, copy it, and then paste it up here. And I can assign this to a vector, right? And then each of these needs to be a character type. So I need to wrap them in quotes. And we'll go ahead and put those in quotes. And the same one on this third one, right? And so now what we've got is we have our C vector of the different files. So before I fire up R, I need to make sure I'm in the right conda environment. So I can do conda, activate drought. This will make sure that I've got everything nice and loaded. I can then do R to get into my conda environment. I will then go ahead and load these two libraries, those load well. So I need to add a closing parentheses here. And then I can run archive write files. And if we go ahead and open up a bash script, let me get some more real estate here, I can then do ls. And I now see that I've got write files dot tar dot gz. And as we saw in the last episode, I can go ahead and do tar xv zf write files tar gz to extract that. I now see that that basically recreated those three files, and it put them into data GHC and the all and whatnot, right? So basically, it re extracted everything back to where I just put it. Okay, so that's archive write files. There's also an archive write durr. And so that would actually be a little bit easier than what we did here. So what I could do is archive, write durr, I'll call this archive write durr dot tar dot gz. And then I can give it the name of a directory, right? And so I can then do data forward slash ghc nd all. And so it'll use the contents of that directory to then create a this right dot tour right durr tar gz. So now again, if I come back and do ls, I now see that I've got write durr dot tar dot gz. That is great. And maybe I'll go ahead and extract this one. And I'll go ahead and do tar xv zf and the write durr. I'll do hyphen capital C practice. I need to make a new directory mk durr practice. I'm writing it to practice. So I'm not rewriting over everything that's already in that data ghc nd all file. So now if I look in practice, I see I've got those three files. The other thing you'll notice a difference between archive write durr and archive write files is that with archive write files, it preserved the path to those files, right? As you saw with the output, it saved the data ghc nd all and then the name of the files. Whereas with archive write durr, it lost all that path, right? And so it put it here in the practice directory. Very cool. All right. So that is how you can create an archive. And if you look at the help documentation, there's a variety of ways that you can compress and collect together all those data. You can make zip files if that's what you want to do. But again, the key point about what we're using here is I think is encompassed in this tar idea, where we've got a whole bunch of files that we're putting together in an archive and then we're compressing it. If you had a standard file that was g zipped, then that would be a single file that's compressed. Here we're really compressing a whole bunch of files together. And so with things like read tsv or read fwf, which we've seen before, you can give that a gz file, but it has to represent a single file. It can't represent a bunch of files compressed together. Okay, so that's a subtle distinction that I think is pretty important. So now we want to talk about, well, how can we see the contents of a directory, right? And so to do that, let's make sure we're in R, we can then do archive, and we can then give it the name of the archive. So let's give it the one we just made, which would be write durr.tar.gz. And so then this outputs a data frame that's got the path to the file, the size, and the date that it was made, right? And so again, that's the one we just made. We could also do archive, and we could give it data forward slash ghcnd all.tar.gz. This is the whole data frame. So that took a few moments to run. Again, it had to kind of parse through 122,012 rows. And it then outputs the paths of the different files that were in this archive. One thing that's a little bit odd is that there's an empty directory path here. I'm not sure why that happens, but that's basically the directory name is somehow getting included in the archive. That wasn't something I could control. That's, again, the way we got the data from NOAA. But again, we have a tiddle now that has all the names of the files that are in our archive. That is basically the same thing if we look at data ghcnd all files dot text. That's basically what's in this file, right? So if I do a head on that, I see the names of all of the different files that are in this archive, right? So again, archive and tar-tvf basically do the same thing. Okay, so we've talked about how to create a compressed archive, how to look at what's in a compressed archive, how do we now read out of a compressed archive? So to read, we can do archive read, right? And so archive read will create a connection to a archive, right? And so we can then give that right, let's do write dirt tar dot gz running that this then as I said, it creates a connection. And so one thing to note is that again, we know that this has three files in it. And again, the full one that I just showed you earlier with this archive 122 of them, right? But we need to give it which file we want to read out. So I could do one that then will build a connection to that first file. And so this isn't super interesting on its own. But we can then feed this into one of the read functions, right? So we could do like read underscore TSV on that. And so again, what we're doing instead of giving it a path to a file is giving it a connection to a file within an archive. So running that, we get a bunch of warning messages. But you'll see that we've read in the contents of that first file, trying to use the TSV and we talked about in the last episode where we talked about fixed with files, why this doesn't work. But again, this allows us to read in a file from a tar gz file, which is pretty cool. And so if I want the next file, I could do two, and that gets the next one, right? If instead of putting in the numbers, I want to put in the name of the file, well, I can grab that easily enough. And what we know is that the first think 11 characters contain the name of the file, right? And so what I could put in here then would be the name of the station dot dl y. This then does the same thing as we saw with the two. So we can give it either the name of the file we want out of that directory, or the number of the file we want out, right? The other thing to note is that it's actually the the path to that file we want, right? So again, if I look at archive on write dirt tg tar gz, I'm looking at these names, right? And so that 6 6 is the second one. If I wanted the third one, I'd grab that name. And that of course has to be in quotes. And that then reads in the third file. And so you can perhaps quickly see that if I wanted to read in these three files, I could do something like this, right? I could, you know, I have these three read tsv functions, and I'm extracting out each of the three files, right? And so we can think about using that archive function to get the names of the dl y files that I want. So let's see how we might do this in a more generalized format, because I don't want to write out the same function 122,000 times, right? So I can take that archive statement. And again, just to remind us, we get the path, right? And so I could do select on path. And let's indent that. So instead of select, so I get a vector, let's go ahead and do pull, right? And that then gives us those three names. So I can then take that vector and pipe it to map the FR. And I'll use a period as a placeholder for the contents of the data coming through. And then in here, I'll go ahead and do a tilde. That's again, the formula notation where I'm going to take that argument from that first slot, which is called x, and I'm going to do something with it, right? And so what I can then do would be like read underscore tsv, again, parentheses, and I will then take this archive read and plop that in the middle here. And I still want that right dar dar gz. But in this position, what I'm going to want is the dot x, again, the value of the vector is the dot x. So now I can run this whole thing that then creates a composite data frame. Again, things are formatted poorly. But you kind of get the idea of what we can think about doing for reading in the full archive with 122,000 rows. So let's head over to our read DLY files. And let's see if we can apply what we've learned to this, right? So I'll go ahead and copy this bit of code at the end of my practice. And I'll put it up here right before the DLY files. So the DLY files I generated by reading in those three files from data GHC and DL. This was the approach we would have used if we could have decompressed the entire archive. So I'll go ahead and remove that. And instead, we're going to grab this bit of code, where we use the archive function. But I'm not going to use it on write dir tar gz, right? Instead, I'm going to do data forward slash ghc nd all dot tar gz. And then we'll pipe that to pull path. And we'll see what this looks like, we might need to do a little bit of tweaking. So that should generate the vector, I'll go ahead and do head on DLY files. And I then see all of my file names with that ghc nd all directory. I also see what I saw before when I ran archive, which was the directory name without a DLY file. So I'm going to add a filter statement to use str detect from the stringer package, where I'll look through the path column and see if there's anything that ends in DLY or contains DLY I should say, that will get rid of this ghc nd all. We'll run this and we'll then double check again to make sure we got rid of that line. Again, if I do head DLY files, I've gotten rid of that line that just had the directory name. And we're in good shape. We now have our DLY files. And we're now ready to think about using map DFR, right? So again, we'll do DLY files, pipe that to map DFR. And we're going to use the contents coming through the pipeline. And we're going to then pipe that to all these popups kind of get annoying with visual studio code. So things are just a little too sensitive. So sorry for all the popups. It annoys me too. All right, so we're going to again use map DFR with the contents of DLY files piped into there. So I need to put a tilde in front of the read fwf. And instead of DLY files, I'm going to do archive read. And then the archive read, the archive that we're going to read is data forward slash ghc nd all dot tar dot gz, right? And then the what I want is the value of dot x, right, that is the file that we're going to extract. And then we will go ahead and read it in through all this other great stuff. And I think I need another parenthesis here. I'm going to go ahead and only run the part where I'm reading in the archive. So before I run this full pipeline, I would rather deal on a smaller scale than all 122,000 files. So I think I'll come back up to DLY files. And on this, I'm going to do a slice sample. And I'll do n equals five. And what that'll do is randomly grab five lines out of this data frame. So I'll go ahead and run both of these steps to make sure it works and see how long it takes to run. And it's complaining that object widths not found. And that's because I forgot to load all this other great stuff that I already had in here. So let me go ahead and make sure that I've got all these libraries loaded and get the quadruple and the widths and the headers. Okay, so that's all loaded. Now we can run this pipeline where we read in the fixed width format on those five files. So that took a little bit longer than I was really hoping it would take to process only five of the different files. Let's come back to the time. I want to double check that I'm getting out what I expect to get out. And so I forgot to save this to a variable. So I'll go ahead and call this composite dot last dot value dot last dot value stores the value of the last output, right? So if I look at composite, I can see basically what we just had, right? So I can go ahead and do count on composite, and then do a search on ID. I think ID was the name of the first column, but it's all caps. So ID. And so then I see those five files and the number of rows in each of those five different data frames. Wonderful. Okay, so I'm going to actually up this. And that took a little bit of time to run for five. So I think what I'll do is let's go ahead and do 12. And so we'll see how long it takes to do 12. And then we can multiply that by about 10,000 to see how long it would take. And so to do that, I'm going to go ahead and do sys dot time. And at the beginning of running the pipe, as well as down here, sys dot time. And then we can calculate the difference between those two different sys time values, you see how long it takes and multiply the result by 10,000. And we'll see how long that takes. So it ended at 1442 20, and started at 1439 52. And so that is about two and a half, say three minutes. So if I did 2.5 times 10,000, that's 25,000 minutes, divided by 60 minutes per hour, divided by 24 hours per day, it's going to take 17 days. That's a long time. So one thing that we could do would be to use the the fur package, which allows you to parallelize map functions, right? So with the GitHub actions resources that they give you, you get three processors. And so this would divide that in three. So instead of being 17 days, it would be like, I don't know, six days, right? So I was hoping to update this every day, not every week, or every couple weeks, right? And so that just doesn't seem really practical. So while I think archive read and the archive package are really useful for perhaps smaller numbers of files that I'm trying to extract from the archive. Ultimately, I don't think this solution is going to work. And so I need to go back to the drawing board and thinking about how I can get the data out of this archive that I want to get out of it. While again, working within the constraints that are being imposed upon me by using GitHub actions, you could say, well, why don't you go use Amazon web services or some other platform? So why? Well, first of all, I want to learn GitHub actions, right? So there's that. The other thing is that I think GitHub actions is going to make things just a lot easier for me than having to kind of spin up my own website on Amazon, or go into Amazon and create all sorts of overhead that GitHub actions is theoretically going to do for me. So in the next couple of episodes, hopefully I'll be able to share with you the solution that I've come up with to dealing with this issue. I have some ideas, but I'm not quite ready to share them with you. So that you don't miss that episode, please be sure that you subscribe down below, give this video a thumbs up and tell your friends what we're doing here. So practice with this, see if you can use archive read and archive write functions with your own data. Let me know how it goes. And we'll see you next time for another episode of Code Club.