 Hey folks, if you've been watching recent episodes of Code Club, you know that I am in the midst of a series of episodes where I am trying to build out a project, a workflow, where I can every day update the data that I pull down from the NOAA website. NOAA is a US agency that keeps track of global climate data, and what I want to do is build a daily update of a visual that shows the amount of drought or lack of drought across the world. So again, we get daily updates. I want to run this every day without having to touch anything, letting GitHub actions do the work for me. Cool. Well, in recent episodes, we've talked about using Snakemake, Kanda, Git, Project Organization, all the good stuff for reproducibility to download data from NOAA. Now we are ready to read those data into our R-Session. One of the problems with our data, though, is that it comes to us as a set of 122,000 or so files that are compressed in a tar archive that's about 3.3 gigabytes. Here I am in my drought index project root directory. If you want to get a copy of my project as it currently stands, go down below into the description of this video and you will find a link to a blog post that gets you all the great stuff that you need. Also, you'll see that to the right of the directory name that I've got main in parentheses in green. Because I'm using version control, I now know that I am on the main branch. It is green, which tells me I am good to go. Everything is committed and up to date. To familiarize ourselves with the directories, I can write LS. I can do LS-F to see which of these things are files and which are directories. And so the ones with the forward slash are directories, those without our files. To look within the data directory, I can do LS data. And again, I can do LS-LTH on data and see those different files that we downloaded in the last episode. Again, the one I'm mainly interested in for this episode is ghcndalltar.gz and it's 3.3 gigabytes. So to decompress that archive, I can do tarxvzf on data ghcndalltar.gz. This downloaded everything. And if I do LS, unfortunately, I noticed that ghcndall, the director that holds all these files, is in my project root directory right here. There's also this file ghcnd versions. So this isn't where I want it. But before I delete this, and then put it where I want it, let's see how big the decompressed version is. So to get at that, we can do du-sh and then ghcndall. And this then shows us that there's 29 gigabytes. So du is nice for getting a composite quantification of the amount of space a set of files is using. LS-L will show us the individual file sizes, but du is really nice with that hyphen s to get the summary of the size of a directory. So we see that we went from 3.3 gigs to 29 gigs. So 29 gigs is much larger than the footprint that github will allow us for using github actions. I think they limit us to about 10 gigabytes. So we'll deal with that problem another day. But for now, let's figure out how we can decompress a tar archive where we want it to go. So before we work with tar more, let's go ahead and remove ghcnd versions. And we can use RM-RF. So that's recursive and force on ghcndall to remove that directory from our project root directory. So that actually took a moment to do. But again, if I do LS, I now see that those ghcnd files are gone. All right, let's cycle back up through our commands. And to export the decompressed data from our archive to a different directory, at the end here, I can do hyphen c, and then the name of the directory I want it to go to. So I can do data forward slash. Now this is giving us that's that same output. I'm going to kill it by doing control C, because this isn't really what I want to do. What I want to do here though is to show you that we were successful. So if I do LS, I now see that I don't have those ghcnd files in my project root directory. If I do LS on data, however, I now see that I've got my ghcndall directory here. And if I do LS data ghcndall, I see all those dly files. Now, I don't want all 122,000 files, because I'm trying to build up a script to read in these individual files and put them together. That's going to take some time with 122,000 files. So I'd rather do is maybe extract three files and create a directory ghcndall say that has those three files. So again, I'm going to go ahead and remove that data ghcndall and we'll show you now how we can pull out three individual files. So again, come back to this tar command that we ran earlier. And I can then do ghcndall forward slash and I'll grab a couple of these ghcndall again ghcndall is the directory that they're in in the archive. And I'm grabbing really any three of these dly files. And we'll grab one more from over here say. And so now what we've done is we've given tar three different file names that I want to extract from that archive again there's 122,000 files in the archive. I only want three. And I want it to be in the data directory so it's going to be data ghcndall and then these three files once it's completed running through this great so again that takes a minute but now if we do LS data ghcndall I now see the directory with those three files. Let's go ahead over to data and see what one of these looks like I see that there's no header so we're going to figure out what these different columns represent, but we can kind of get the sense that maybe there is some like tab separation or space separation to the data. In this one there's 54 different rows, and a number of different columns. So let's go ahead and come into code directory and open up a new script. And I'll call this read the ly files dot r. And so the nice thing about Visual Studio Code is that it guesses that this is going to be an R script. It puts the nice R icon next to it. And it's going to do things like if I do library, it will then pop up information from the help about the library function. It also kind of knows the different types of words that we're going to be using right so it gives us help there so I'll put in tidy verse, and I can now run that down here into my terminal, loading all the great tools from the tidy verse. Great. So what I would typically do without much thought would be to go ahead and do read underscore TSV, and then give it the path to the file I want. So I'll do data forward slash ghc and the all forward slash asn 123408255.dly. Clearly, I would not want to be typing in all those file names for 122,000 files. So we'll come back to that again also. So what we see is that read TSV actually really struggled with this file. There is no header, no column names. And so it took the first row and made that the column name. Also, it didn't find any tabs. And so it basically put all the data into one column. And so there's no error message, but it didn't really do what we wanted it to do. An alternative to read TSV might be to do read table. And so read table will parse a file based on white space. We do this, and we find a little bit better that we get multiple columns, right? What we find though is that a number of the rows have different numbers of columns. And so that's a problem because we're expecting the data. Each row to have the same number of columns, it doesn't. We again still haven't solved the problem with our column names. And what we find is that a number of the columns have the same names after doing this parsing. So what we need to do is what we should have done initially. And that's to look at the read me file. If we look at the read me file for the daily global historical climatology network, the GHCN daily, which is up at the archive where we downloaded the data from, we can scroll through here, looking for some description of these dly files, which populate the archive, right? So scrolling down here, we find in section three format of data files, the dly files. So each dly file contains data for one station, right? So we have 122,000 files. We have 122,000 different stations. Each file name corresponds to the station name. And so then what we find is that each record in the file contains one month of data, right? So each row is a different month. And so the variables on each line include the ID, the year, the month, the element. And then what we get are these quadruplets quad tuples of columns that go together, right? And so each quadruplet corresponds to a different day of the month. And so what we see is this goes down to 31. And so this will probably cause problems for months like February that have 28 and occasionally 29 days that it's, it's basically going to plug in and a values for day 30 and 31. Of the month of February and usually for day 29. But what we find is that this is what's called a fixed width formatted file. It's not a TSV, a tab separated file, or it's not a comma or CSV separated file. It's a, it's a formatted based on position on the line. And so we find that the ID appears in columns one through 11, the year between columns 12 and 15 month 16 and 17 and so forth, right? So instead of using a character to delimit the data, it's using a position, right? So I'm going to go ahead and grab this table because I think it will be helpful. And I'm going to plop it into my R script. And I'm going to go ahead and comment this so that we don't have to worry about accidentally running it. Let's get us some more real estate here. And I'll also grab the URL for the read me so that it's always handy and accessible. Instead of read table or read TSV or read CSV, what we'll do is read FWF read FWF again will parse the data based on fixed width. And so the default approach is to basically do what read table does. And so it's looking for different spaces to figure out how to create the columns, but to do it in a rectangular format so that every row has the same number of columns. And so we get through this and we don't see any error messages, right? Which is great. We still don't know what these different columns are. And one of the problems is that X1 actually contains multiple fields of data, right? So if I do select on X1, I see that I've got the station ID, right? Which is the first 11 characters, the year, this 1929, the 07 for the month, and then PRCP for the element that it's measuring, right? So the default behavior of read FWF parses on white space rather than on actual positions in the data. Let's go ahead and look at the help for read FWF and minimizing that. And what we find is that there's four different approaches that read FWF can use to parse apart the data, right? So again, the default was FWF empty FWF widths where we give it the widths of the different columns. We can also give it column names and then positions, which is another way to give the same type of information and FWF calls again the same type of information. I'm going to do FWF widths. I'd encourage you to perhaps go back and see if you can't figure out how to do it with these two other helper function. So again, I'm going to go ahead and remove that pipe. And what we can then do is FWF widths. And this then takes two arguments widths and column names. So I need to create two vectors of widths and column names. And we'll do that based on the information here in this table. So again, if I do widths, I'll create that vector. And the again, remember the column width is going to be like, and for ID at least 11 minus one plus one, so it'll be 11 characters wide year 15 minus 12 is three plus one is four, right? So we'll do that. So we'll do 11, four. And then month will be two elements will be four. And then we have value m flag q flag and s flag for each of the 31 days of a month, I don't want to write that quadruple it out 31 times. So what we can use instead is the rep function. So we'll do rep on a C vector, which value is going to be five characters wide m flag q flag, and what s flag are going to be one long. And I'm going to repeat this 31 times. So now when I run widths, I get 128 different values, which I can see by doing length widths. And then I can sum the widths and see that there's 269 columns, which is exactly what the read me tells me there should be, right? So we have our widths. And now we need our headers, right? So we need to create the names for our different columns. Again, we'll do that with a C function. And the first was ID, I'm going to leave these in all caps to replicate what the read me says right. And then we have year, month and element, right? And then we need a to create those quadruplets, right? And again, I don't want to have to type those out. So what I'm going to do is I'm going to create a function that will produce the quadruplet. And I can then take that function and iterate it over the values of one third to 31, using the map function. Okay, so let's kind of build this function out a little bit organically. And we will use the glue function to this. So again, I'll come back up to the top here and do library glue, get that loaded. And we will then do glue. And we will then say, value, and then in curly braces, I'll put x. And so x for now I'm going to make 29. And then I get the output being value 29, right? And so I can create a vector of these using a C function again. And we will go ahead and copy and paste these out. And we're going to use m flag, q flag, and s flag, I don't really care about these actual names, because immediately I'm going to get rid of them. So anyway, I just like to have it be the way it's supposed to be. So again, that's an example of the quadruplet, we can then do quadruple as the name of our function. And that can then take x as the argument. And we'll go ahead and tab that over, get some nice white space in here. And I've got a closing brace there. And now I can create the quadruple function. And let's give that like, let's give it 20. And so now I get the four values for 20, right, I'll go ahead and get rid of that x 29. And I've got quadruple loaded. And again, what I can do would be like map one colon 31, one colon 31 is going to create all the integer values from one to 31. And then I can give that quadruple. And now what it does is it produces a list where each value in the list is a vector corresponding to each of the different days. So I want a vector though. So what I could do is I could wrap map inside of unlist so unlist will flatten my list into a vector, giving me 1201234 different values from my quad quadruple function. So I'll go ahead and grab that. And I will paste that into the headers vector. I think I ran it when I still had something on the line there so I'll go ahead and rerun that. And again, if I look at headers, I now see that I've got 128 different values, and we're good to go. Great. So now we're ready to come back to FWF widths. And I can put in widths and headers. And we can run this. And what do we get is that we've got the ID the year the month the element the different values. That's all looking good. So these NAs are reminding me that I thought I saw something in the read me about missing data. And so if I scroll down, yeah, so value one is the value on the first day of the month missing is indicated by the value negative 9999. So we need to add that as a possible and a value in any of the read functions from the read our package. We can do a argument called na where we give it a vector of values. So I'll say na. But we can also give it negative 9999. This then is giving me a warning message that there's one or more parsing issues to run problems for details. So I'll do that. And what we find is that on row one column six, it expected a logical but it got a character, right? And so again, if we come back row one column six 123456, that this is a logical and probably really should be a character. So one of the things that you could do with any of the read functions from read are is specify the column types, right? So I could do call underscore types equals calls. And then I could say like month equals call double. And so what that does is that says well month should be read in as a double. So now when I run this, I'm still getting that warning message but month at least is being read in as a double, right? So that works for a lot of situations where you have a small number of columns, we have 128 columns. So I'm not going to do that for all 128 columns. But what you can do instead is to give a default type of data to read in. So we could do period default equals call character. And so what this will do is this will read every column in as a character, getting us over that problem that we have for some of these flags, where they're logicals, and it really feels like it should be character. So we'll run that great so that warning message goes away and we're in good shape. As I mentioned, we have 128 columns. I'm not interested in most of these columns. I can use the read function from read are to pick out to select out the columns I actually want. And so I can do that with an argument called call select. And to call select I can give it a vector of column names that I want to keep right so I could do ID year month element. And I want all of the columns that start with value well if you've used select before you know they're these helper functions so there's starts with and we can then give that value. Right. And so now when we run these lines, I must be forgetting a closing parentheses in here. Yep, for the read fwf. I'll add that. So now what we see is that we don't have any of those flag columns. And we went from 128 or so different columns down to 35. Right. So we have the columns that we are interested in recall that we actually have three files in GHC and D all and as I keep mentioning there's 122,000 total files right. Well, what we can actually do is we can give read fwf a vector of values a vector of file names. It will then read it in using all these same arguments, and then concatenate it together in the past, I would create a function that has all this information. And then I would map it over all of the file names. That's not necessary I've learned right. So let's go ahead and create a vector with our DIY files that we're interested in. So I'll do list dot files. And I'll give it the path. So data forward slash GHC and D all running that I then get these three DIY files right. And I can then get the full long names by doing full dot names equals true. This then gives me the path to those files, and I will then call these DIY files, DIY files has the full names, I can then replace this single file with DIY files right. And so now when I run that I get a table that's again has my 35 columns but now has 1043 rows. And if I pipe this to a count on ID, I will see how many rows I have for each of my three different weather stations right. So that's really slick instead of having to run read FWF three times, or to map it over a list of file names, I can give that vector directly as the file argument to any of these read functions read FWF, read CSV read CSV, and it does it for me. Right. Cool. So again, the only thing in here that's unique to read FWF is this FWF wits argument. So let's see where we're at with this. We again have ID year month element and then 31 columns for the different precipitation, or whatever different type of data is in here. Let's go ahead and do a count on the element to see what other types of data are in here. We've got all sorts of different things. There is this PRCP. So there's a few things that I want to do to clean this up before we write this out to a storage file. That's a lot more compact than those 122,000 files. So the first thing that I want to do is get rid of these dreadful column names that are all caps, right. And so to apply a single function to all column names, we can do rename all, and I can give that function to be too lower. And so to lower takes uppercase letters and turns them to lowercase letters, right. And so now what you'll see is that all of my column names are lowercase that makes things just I find so much easier to work with. The other thing I want to do at this point is go into the element column and return only those rows that have precipitation data. Right. So again, I'll do a filter on element equals equals PRCP again PRCP is the value in element that is still all caps. So we go from 1043 rows down to 884 rows, because as we saw we need that count. There were some extra variables in there that that we're not interested in right cool. I'll go ahead and do a select to remove the element column. And what I'd like to do next is to go ahead and pivot this longer so that I have a column for the day and a column for the precipitation, rather than these 31 different columns right. So to do that will do pivot longer. So I'll give it calls and then I'll use starts with value to go ahead and take those columns that start with value and pivot them longer. And so now what we get is the ID the year the month and a column called name and value name and value are the default names for the column names and the values that we saw. So I'll do names to equals a day and values to equals PRCP. Right. And maybe I'll put this over two lines so it doesn't go off the side of the screen. Running this we now get. Yeah ID year month day and PRCP. Cool. So I want to do some modification to the date information as well as to the PRCP. So let's start with the date information. I want to create a single column from these three columns that's the date, but first I need to convert day into an actual number, even if it's a character I don't want that value in there. So again, I'll do a mutate on day to do str replace on day to take value and replace that with nothing and running this, we will now see that we've gotten rid of that value from the day column. And I will then create a column called date and to do that I will use the ymd function that comes to us from lubricate so I'll do library lubricate. Right and get that loaded that's of course installed with the tidyverse package and I will use glue to build out the date and it'll be year month day notation right and so we'll put in quotes year and then years and curly braces hyphen and then another set of curly braces with the month not the day month. And then hyphen day in curly braces right and actually this needs to be in glue right so glue all that stuff right so now when we run this we run it and we get a date column that's of type date, but I find that there is 497 rows that failed to parse my thought is that that's probably happening in months like February, where there's generally only 28 or 29 days. So to deal with that, I'm going to insert after pivot longer but before the mutate a drop and a which will get rid of those dates that don't exist, because they're going to be represented by NA values. One other thing to note though, is that when we do this, if it was a precipitation value that was NA for a day that actually existed, then we're basically going to be assuming that that was a zero value. Okay, so I'll go ahead and run that. We now find that the parsing issue goes away. So the next thing I want to do is take on that PRCP column, which again everything was read in as a character by default. But PRCP needs to be numerical because we're going to want to do some data manipulations with it, assuming that it's a number before too long, right. So what we can do to make that a number would be to take PRCP and do as dot numeric on PRCP. And so now again, if I run all this, I find that PRCP is of type double. So that's numerical. I'm also not sure what what the unit is on that returning to our read me. I'm going to do a search for PRCP here and find that precipitation is measured in tenths of millimeters. So if I want it to be in millimeters, I need to divide the value by 10 centimeters by 100. I think I'll go ahead and do it in terms of centimeters. And so I'll go ahead and divide this by 100. So I'll go ahead and leave myself a note to say PRCP now in CM. Okay. So again, we run all this and no error messages, no warnings. But we now see that we have precipitation in terms of centimeters. Cool. So the final things that I want to do is I want to get rid of the columns that I don't want. And I want to write it out to a file. So I'll go ahead and insert a pipe here to the left of my comment. And I'll do a select on ID, date and PRCP. Right. I don't need the month, day or year. So we'll go ahead and run that. And now we've got our three column data frame with those three columns that we want. I can now do write TSV, writing it out to data, and I'll do composite dly.tsv. And I can go ahead and run all this. And I now look over in my data directory. I've got composite dly TSV with my three columns. As I've repeatedly mentioned, we have 122,000 files. It's gigantic. It's going to take a long time to run. So I'm not ready to decompress that full archive and run this on it. Because I know that because of engineering concerns and working with Git actions, I'm not going to be able to fully decompress the archive. So in the next episode, we will use some special tools from the archive package of R to pull out individual files from an archive where we give it the file name. This will allow us to build our composite file without first decompressing the entire archive. I'll put a link to that episode over here to the side. Be sure you watch that so you can keep up to date what's going on with this project. We'll keep practicing with all this stuff. I hope you learned a lot about the different read R functions, and I'll see you next time for another episode of Code Club.