 I believe you can answer your own data analysis questions. Do you? You should. Stick around for another episode of Code Club. I'm your host, Pat Schloss, and it's my goal to help you to grow in confidence, to ask and answer questions about the world around us using data. In the last episode of Code Club, we downloaded the files that we'll be using to determine to what degree inter and intra-genomic variation limit the ability to interpret Amplicon sequence variants, or ASVs. We did this in the context of learning about GitHub flow, also known as GitFlow. One problem with how we did it was that we needed to use a browser to navigate to a website to then download the files. This is fine if we only need to do it once, but if we had to do it for multiple files, it could be a real pain. Imagine that there's a directory of 100 files you want to download. Downloading all those files manually would just be horrendous. Our approach from last week would also be pretty challenging if we were working on a high-performance computer that doesn't have a browser window. For today's Code Club, we'll use GitHub Flow again to see how we can download the files we got last week using tools directly from the command line. In a future episode, we'll gain a greater appreciation for this approach because it becomes easier to automate the process if the files are updated or if ours ever get corrupted. Please take the time to watch today's episode all the way through, follow along on your own computer, and attempt the exercises. Don't worry if you're not sure how to solve the exercises. At the end, I'll provide solutions. In the notes below this video is a link to a blog post for today's episode that includes installation instructions, reference notes, and other helpful links. They are meant to be a supplement to the material in this video. Finally, don't forget to like and subscribe the Ruffimona's channel, and click on the bell icon when you subscribe so that you're notified when the next episode is released. As I mentioned in my introduction, what we're going to be working on today is using command line tools to download the files that we downloaded last week, but that we'd used the browser to do. Being able to do it from the command line allows us to automate this process. It also allows us to do these downloads on a high performance computer where we don't have a browser, say. So if you're using Windows, everything will be installed. Again, you're going to want to use the Ubuntu Bash system within Windows 10. If you're using a Mac, this is the one time where you have to install something special. Look at the blog post that accompanies this video. There's instructions on how to get a tool called Homebrew and to use Homebrew to install WGet. Mac is like Linux, but it doesn't have everything that Linux has installed. And so Homebrew is a useful package manager for getting some of the other tools that are in Linux, but for whatever reason don't come with a Mac. So as we get started today, what we're going to do is we're going to revisit what we had done last week, but we're going to automate that download. So we don't have to use a browser to download these files. So I'm going to start with a new issue. Coming back to this issue list, we see that we still have two issues here that we haven't resolved. Issue four is to align our fast day file from the RNDB to the silver seed. We'll do this soon, probably next Monday's episode, but we're getting there. Also, we're going to continue to generate a resource list. And so this issue five might be an issue that we never really close until we're ready to write the paper or presentation or whatever we do with this, right? I don't want to jinx us by saying we're going to write a paper, but I'd like to. Anyway, so I'm going to open up a new issue. And so this is going to be issue six, right? If our numbering is right. And so what I'm going to do is to say download reference files from using command line tools. And so I'd like to download the silver seed and RNDB files using command line tools rather than a browser because if they're ever updated, I don't want to have to. Great. And so in the readme files for those different directories, I had the links. So I'm going to say see the links in the readmes for, I think it was data raw and data references to get the location of the files we want to download. So these back ticks around data raw and data references is a little bit of a syntax formatting called markdown, which we'll talk about on Thursday. But all you need to know right now is if I use those back ticks, if I preview it, it puts it into a special font. It's kind of called a monospace font that makes it look more like it's been generated for a computer, like a computer terminal, right? All right, so I'll go ahead and submit this new issue. This is issue six. So I'm going to go back to my command line interface. And I'm going to navigate to my project directory. And hopefully this has become old hat for you at this point. So I'm going to do CD documents, Schloss R&D analysis, XXXX 2020. Do an LS just to remind myself of where this has been. It's been a few days since I've looked at this. I can do get status. We're up to date on the condition of our project. Let me show you another get command that we can do that I find to be helpful. Again, in the context of like LS, get status to figure out or to remind ourselves where we were in a project. And that is get log. So my type get log, I get a listing of all the previous commit messages that I have made in my project. And so I see that I had downloaded R&DB. I downloaded the silver reference files and I installed mother to the code directory. So again, the get log function is really nice and really handy for reminding us perhaps what we had done previously. So I'm ready to work on issue six. So again, we talked about GitHub flow or also called git flow before. So I'm going to get started with today's activities by checking out a new branch to go ahead and do git flow or GitHub flow. So do git branch issue six, check out the different branches I have. And I see that I have every issue as its own branch and I still have all those branches. I could go through and delete these branches but for now I'm going to leave them in here. So I'm going to go ahead and check out that issue by doing git check out issue six. I'm on that branch now. Again, I can do git status to prove to myself that I'm on ish branch issue six. Nothing to commit, nothing to commit. The working tree is clean. So let's go ahead and look at the read me for that raw directory, raw data directory. So if I do nano data raw, read me. I see that I obtained the files from this link. I'm going to go ahead and copy this and come back to my browser and see that I have for version 5.6 there's actually four files that I'd like to get. I'm going to go ahead and right click on one of these four version 5.6 files to get the address. So copy link will get me the address for that file. And if I come back to my terminal and I do right click, paste, or you can do command V, you see the whole link is there. Now I would be nice to be able to hit enter and get the file, but that's not the way things work, right? So instead what we're going to do is we're going to use this wget function, wget command. So I'll do wget and again copy and paste the file down, the URL down. So we're going to wget this file that ends in RNDB hyphen 5.6, period TSV, period zip. So if I hit enter, there's some output that comes. It looks like no error messages, everything worked. There is a progress bar here. So if it was a bigger file, it might take longer to go across. And so if I now do LS, I see that I have RNDB 5.6 TSV zip and it's in my project root directory, okay? So that's great that we got it downloaded but I'd prefer for it to be in my data raw directory, right? And so let's go ahead and we can use another flag with wget so we can say wget hyphen hyphen directory prefix. And then I'm going to put data forward slash raw forward slash. And if I hit enter, I forgot to give it the URL. So I paste in the URL. So I'm realizing that I perhaps forget how to use wget. So the way I can get help with wget would be wget hyphen hyphen help. And this then outputs to the screen all the different things you can do with wget. And there's a lot here and we're only going to cover two things with wget. And so let me see if I can find this directory prefix, the directory I want to output things through. Ah, so here it is in the directory section. So it says hyphen p or you can use hyphen hyphen directory prefix equals prefix, the directory I want to send it to. So I need to come back here and I need to put an equal sign in between directory prefix and data raw. And I wonder if I'm missing a space in between. Yeah, so I was missing a space between the raw forward slash and HTTPS. So that's good now, okay? So if you get this error message that says, you know, missing a URL and you've got, well, I've got the URL here, maybe make sure that you do have a space between HTTPS and whatever preceded it. Okay, so now we see that data raw rndb.zip is saved. So if I do ls data raw, I see that I have my rndb 5.6 tsv.zip file in here. So that's great. And so as we saw with the help document, we could use directory prefix. We could also use hyphen p and it does the same thing. Now, what I just noticed running this twice is a unfortunate behavior is that it appears to be saving the file to data raw, but it's concatenating a period one at the end. And you know what, I bet if I ran it again, I'd get a period two. So it's not writing over the file I already have, but it's putting a one, two, or however many times I've downloaded it onto it. So if I do ls data raw, I see that I've got zip, zip point one, zip point two. So this is not good, okay? So I'm gonna go ahead and remove those zip files. So hopefully you remember how we can use the RM function from the command line to delete files. So we'll do RM and then I will include these files. And I forgot my path, right? So it's important that we give it the path to these files to delete them. So I'll do data forward slash raw. So now if I do data raw, those are gone, okay? So the problem again is that I download this file everything works great. If I had happened to run the command again, it's gonna create the same file, but it's gonna tack on a point one, and if I do it again, point two, and so forth, right? So how do I set up the command so that this doesn't happen? So I can give wget another auction, which is hyphen hyphen no hyphen clobber. Very descriptive. So if I include the no clobber flag to wget, then it's not going to download something if it's already found in my directory. So again, if we previously, when we ran this without the no clobber, we would get a period one tacked on to the file name. Now what we're hoping was with the no clobber that that is not the case. And so what we get as output is exciting. Our file is already there, not retrieving, right? So again, I can run this as many times as I want. The zip file is already there, it's not gonna retrieve it. If I were to do remove data raw, remove the zip file and run with the no clobber again, it downloads it just like we'd like. And again, if I keep running it, nothing happens. Now, writing out no clobber is descriptive, but it's kind of verbose. An alternative to writing out no clobber is hyphen NC. And that again, gets us the same functionality. Excellent. So we with this have downloaded our file without going to the browser. Sure, we use the browser to get the address, but we've downloaded it without having to right click or move files around. It's downloaded it directly to the directory you want it. So I'm going to add this line to my readme file. So I'll go nano, data, raw, readme. And I'm gonna put that in. Sometimes nano doesn't really do a great job of wrapping characters to the next line. So when I pasted this in, what I noticed is I have a dollar sign at the start of the line and then zip at the end. But if I go to the next line, I see the full syntax I had written out with a dollar sign at the end, which tells me that there's more to the right. There's some settings that you can fiddle with with nano to give it to wrap a line for you, but don't worry about it. And so I'm gonna include some text here to say we can be automated downloading the TSV file with WGet, okay. Excellent. So control O saves it, control X brings us out. Very good. So that's WGet. That's how we download files from the internet without having to use a browser. It's very, very helpful. Okay, so we don't wanna use a zip file though. We wanna use the actual TSV file. So how do we decompress it? That file name ends in zip, which tells us that the file was compressed using a program called zip. So if we want to unzip it from the command line, the tool we will use is called unzip, okay. So we can imagine that we might wanna do unzip, data raw, and then the name of our zip file, right. And so if we run unzip on this, yep, it tells us that the archive was located in data raw and it's inflating rndb5.6.tsv, okay. So if I then do LS, ah, I noticed that rndb5.6.tsv is actually in my project root. It's not saving it to the one in my data raw directory. I have one in data raw, but that's from last episode. So I'm gonna go ahead and remove that. And I'm also going to remove the one in my project root and look at it, everything looks good. So how can we output zip, the decompressed zip files to a desired directory? Well, again, we can do unzip hyphen hyphen help. When I look at the output from running unzip help, I see a lot of really great flags that I could use to get zip to do different things. But again, what I want it to do is I want to be able to extract files into some type of directory that I define from the command line. And so as I look at help pages like these, I'm gonna be looking at, looking for words that match what I wanna do, like extract files to, right? Or something about a directory, right? Maybe like a D, right? For directory, something like that. Again, as I look at these options, different flags, there's a lot of really interesting things here that I'd love to be able to play with that I could see really helping me to use this function better. But the one that my eyes gravitate to is the hyphen D extract files into xdir. So what I'm gonna do is I'm gonna try that out and we'll see what happens, right? So one of the nice things about version control is that I can experiment. And as long as I'm backed up to my latest version, I can experiment with a function. And if everything goes bad, well, I can go back to a previous version of my repository. And we'll see how we do this here shortly. So I'll do unzip, data, raw, my zip file. And actually, I wanna put in hyphen D, data, raw, okay? So I'm gonna unzip the contents of this zip file into data, raw. So let's give that a shot. It says archive was this, as we saw before, inflating data, raw, to TSV. And so if I look at data, raw, I now see that although we had previously deleted it, it's now there, right? And if I look at my project root directory, I don't have that rndb file there. So one of the things to be aware of with a zip file is that a zip file can contain multiple files that it has compressed. So the zip file we're working with here is a dot zip and it only contains one file in it. But it might have its own readme file in it, right? And so when we unpack it, it could create a new readme file that would then write over my readme file here, right? And so that's a not very desirable behavior. And so we can add another argument that we see up here, hyphen N, never overwrite existing files. And so if I come back to this other unzip, I can add the option hyphen N to not write over existing files. And so again, nothing's changed, it all looks good. I'm gonna go ahead and copy this unzip line and paste it into my data raw readme file. And these are the two lines now that get me rndb hyphen 5.6.tsv. There's three other files from the rndb that we need to get and I'll leave those for you to work on through the exercises. So we out, I'm gonna add here that we automated downloading and extracting the tsv files with wget and unzip. So if you wanna make a zip file, what do you think you would use? Well, probably zip, right? So there's zip and unzip, okay? So great, we'll go ahead and write that to file, save it and exit back out. And so good. And again, like I said, those other three files we'll get later in today's exercises. The next thing we wanna get is that silver recreated seed alignment document. And so I'm gonna nano data references readme. And I've got, ah, I included the link here. So like we did for the rndb zip file, we're gonna use wget with the same arguments that we had used. So we'll do wget hyphen nc hyphen capital P and the directory we want it to go to. So with the rndb, we put that in a data raw. We're gonna put this into data references, right? And then I'm gonna paste in the url and it then instantly downloads it for us. So this took a little bit longer, three seconds. And we see that we now should have data references, silver seed 138.tgz, ls data references. And sure enough, we have our tgz file there. Great, okay. So we've talked about how we can use wget to get a variety of files that we can download. Wget is really versatile. You could download an entire website using wget. So it's really great. This silver tgz file looks a little bit different than the zip file that we saw before. It's got this extension tgz. What does that mean? Well, let me unpack this a little bit for you. It's a little bit of a pun. So gz is short for gzipped, okay? It's a lot like zip, but gzip has a couple differences from zip. So zip is much more common for people using Windows. Gzip is much more common for people using Max or Linux. Zip will only, well, I'm sorry. Gzip will only compress one file at a time, whereas zip will compress a lot of files together, okay? And so the other element to this name is the t and that is short for tar. So this is a tarred gzip file. And what tar means, going back to that idea that zip will allow you to compress multiple files together. If you tar files, tar is sticky if you've ever touched it or if you've ever driven over fresh asphalt and you know that it might get stuck to your car, this sticks stuff together. So you can take a bunch of files, stick them together and that'll make a tar, what's called a tar ball, and then the gzip will then compress it, compress that tar ball to make it more efficient to store. Again, when we're downloading things from the internet, we don't want to be downloading these monstrosities of files that are way too big. We want them to be compressed so that it's more efficient to download them. And so that zip, gzip, targz are all examples of doing that. All right, so I've got that downloaded and so what we're going to do to unpack this compressed tar ball is to do tar and to extract the tar ball and to decompress it, we're going to use a set of options, X, V, Z, F, and we will then put in data, references, silver seed, V138, TGZ. And so let me explain what's going on here. So X is short for extract, V is short for verbose, Z means use gzip or gunzip to decompress it, and F means what follows is the file name that we want to decompress. So if I hit enter now, we can see that it's extracting silver seed, V138.tax, silver seed, V138.align, and readme.md. Now I'm a little bit worried based on what we did with zip that these aren't going to the right place. So let me see, LS, and sure enough, these got extracted to my project root and I'm instantly concerned that I extracted a readme file and I had a readme file in my project root directory already and oh no, did I just screw things up. But of course, even if we screwed it up, we've got version control. So let me do nano readme and sure enough, it's the readme for the silver file, not the readme for my project. So how do we undo this problem? We're gonna come back shortly to figuring out how do we output these files to the correct directory, but what we just had happen is what's called a tar bomb where we've decompressed, extracted our tar ball and it has wrecked havoc on our existing files. So if I do get status, I see that I've modified readme.md in my project root. I've also modified data raw readme and I know that I put in those lines of code for getting that TSV file from the RNDB. So my problem though is that I've modified the readme in my project root directory. And so if I look at the arguments here, I see that there's get restore readme. And so that will hopefully get me my file back. Another way that works for older versions of get that I use much more frequently would be get checkout space hyphen hyphen readme.md. You could also do get restore readme. So again, don't freak out if you've written over something. It's good to be mindful and aware of what's going on as you go through, but also know that get is being very helpful in giving you options for how do you wanna commit things or how do you wanna restore things? So I don't have to remember get restore. It's right there in the output to tell me. Again, if you've got older version of get, get checkout hyphen hyphen readme, we'll bring it back. And to prove that to ourselves, if I now do nano readme, I see I've got my original file. Great, so let's clean this up and I'm gonna go ahead and remove silva seed v138 align, silva seed v138.tax. Alas, everything is good, okay? And again, if I do get status, I see I've only got that one modified file. Remember, we're not tracking the large files in data raw. Okay, so how do we avoid this problem? Because we have this tarball that contains a readme file, if I extract it to my project root directory or even if I extract it to data references, that has a readme file in it as well. So the best practice for dealing with TGZ files, regardless of if you think or don't think that there's a risk of a tarbomb going off, is to create a directory that is empty to extract into. And we can do this using the argument hyphen c. And so I can then do tarxvzf, data references, silva seed v138.gz, and I am going to then do hyphen c, data references, silva seed forward slash, okay? And so even though this directory doesn't exist, the hyphen c will output the extracted files into this directory, it'll make the directory if it has to. So it won't make the directory if it has to. I was mistaken on that. So what we need to do is make the directory for it. And so we need to do mkdir space, and then we need to say makedir data references, silva underscore seed, okay? So mkdir makes a directory. So now if I do lsdata references, I see that I have a directory, sure enough, called silvaseed. And so now if I rerun my tar function, it's extracting it. And if I look at data references, silvaseed, sure enough, those files are now there. Now I want to get rid of these silvaseed files that are in data references because it's just kind of clouding up the directory. So I'll do rmdata references, silvaseed138align, and then data references, silvaseed138.tax. And so everything is good there. So I need to update my readme and I'm going to do this in a couple of steps. So I need to come back and grab this wget function call. And so I will then put in my wget and then we did mkdir data references, silvaseed. And I'm gonna save this and come back because I forget the exact syntax I used. There it is, put that down here. And so then I will say downloaded silvaseed file for alignment and taxonomy file from there. We used wget mkdir and tar to download and extract silvaseed files to data references, silvaseed. So again, I'll save this and come back and everything looks good. I see I've got these two modified readme files that we've been tracking and my project root directory is clean, my references directory is also clean. I now have three exercises for you to work on where I'm gonna have you pause the video, work on these three exercises and then we'll come back and discuss them. The first, what arguments would you give wget to supply a username and password? Alternatively, what argument would you give if you would rather wget ask you for your password rather than typing in the password? So I've shown you how you can find this. I haven't talked about how you can supply a username or password, but I've shown you how you can find information on these different commands from the command line. How could you list the contents of RNDB without decompressing it? And how about the contents of silvaseed TGZ without decompressing it, okay? So again, that information is stored in the same type of place where you would find out about supplying the username or password. And finally, I'd like you to finish and push issue six to your version of the repository. Remember that there's three other RNDB files that need to be downloaded, decompressed, and put into the correct location. So go ahead, pause the video now, work through these three exercises, and when you're done, press the play button, and I will show you how I would approach answering these exercise problems. So the first question was, what arguments would you give wget to supply a username and password? So my go-to on this, because I'm too lazy to open the browser and go to Google, would be wget-help. And what I'm gonna be looking for again is like username, password, right? There's like a lot of text here. I might do one scan of it. So I see that I could use HTTP user, HTTP password, that's one option. I see up here that there's another option for download under download that is user and password. So this is shorter than the one we saw just a minute ago where you could do hyphen hyphen user equals, and I could put pshloss hyphen hyphen password equals, and then, you know, abcd, that's not my password, right? And what we see right below it then is if we'd ask it to prompt us for the password, we could use hyphen hyphen ask hyphen password. So we'd give it the username and then ask the password to get it to work. So the second question asks, how could you list the contents of rndb hyphen 5.6.tsv.zip without decompressing it? So again, for these types of questions, I'm gonna do unzip hyphen hyphen help. And looking at the different arguments, the first one that I see is list files. There's another one that looks like it that's hyphen v list verbosely, but what this is doing is this is going to decompress it and list verbosely, what's going on? We don't want that. We want it to list the files. So again, if I do unzip hyphen l data raw rndb tsv.zip, we see that it's got one file and this is the file name and it does this without decompressing it for us. Again, this would be helpful for diagnosing problems with like a tar bomb, right? So if we looked at this and we saw a readme file and we said, I have a readme file in that directory, you know, you need to be careful with that. Alternatively, what about the contents of silver seed v138.ggz? So I'm going to do tar hyphen hyphen help and what I see is this allows us to manipulate files and let's see. So create, add, replace, list, update, extract. So we use c, right? So we did that cv zf. So instead of c, I'm going to do t. So I'll do tvf because I don't need z to decompress it. So I will do tar tvf data references silver.seed v138.tgz and again, this is going to list out the contents of that directory without actually extracting it and I can prove that to myself by doing data references and see that it's, I don't have the decompressed version there and also if I do ls of where I am, I don't have anything going on there too. So the dash l for zip will list out the contents of a zip archive, the tvf will list out the contents of the silver seed, the tgz file for us. So we need to finish our work in downloading the files from R&DB so we can finish issue six. So again, I'm going to go back and open up the raw R&DB. Oh, no, I can't write and type data raw read me because this has my instructions for me, right? So I'm going to come back to this URL. I already got the TSV. So I'm going to basically repeat these commands for the three other files, right? And I'm going to go ahead and right click on that. And then I will use my wget and my unzip. So I'll do wget dash nc dash p and then the data raw and then the URL that I'm going to download from, very good. And now I'm going to unzip that file. So I will then do unzip and I'm going to do hyphen d data raw and I will then do data raw R&DB, five points underscore. So I'm pulling the name from here, right? Fasta.zip and I'm using hyphen d so it'll output there. I want to also include that hyphen n so that I'm not overwriting things just to get in the practice of doing that, very good. So I'm going to then copy these two commands into my readme file, that and then this, the unzip. All right, and so now we have two more files to go. And so I will try to do those quickly here. Right click to get the link, wget hyphen nc hyphen p data raw and then the URL and then unzip hyphen n hyphen d data raw and then the file I want to decompress is that. And let me just give a quick check to make sure I'm getting everything. So we're good. So what we'll see is that we've got our zip files and our TSV files. So let me go ahead and I'm going to copy all of this. One of the nice things in nano is that you can do control K to delete lines that you don't want and delete as well. And again, we're going to do the same thing. And so what we noticed is that on these commands where I'm using the dash N, it's not actually decompressing them because I already had that file in there from when I was downloading them before. All right, so I will go ahead again and copy these into my data raw read me. And I'll use that great command K function to get, I guess it's not command, it's control, control K to remove entire lines. And the next code club, we'll be using a fancy text editor called Adam, which makes a lot of this just easier. So save that and come out. Now, I think we've gotten everything we need to get. I'll do get status. We've updated these two read me direct read me files. So I'm going to go ahead and make my commit. So I will then do get add data raw read me, data references read me. Again, because we're telling get to ignore anything in data, it's asking us to use the hyphen F because we do want to track these read me files. Get status, we see they've been modified. Again, if we wanted to undo this, kind of like we saw before with that read me file, we could look at the get restore function. I can now do get commit hyphen dash M and my message is going to be download files from, I'm going to say using command line to improve automation. So we haven't really set it up for automation yet. That's going to come later. So maybe I'll just leave it at from command line. Nothing to commit, everything is clean. So we're good. So we're going to go ahead and get checkout master. And then I'll do get merge issue underscore six. And I'm realizing as I type this, that I forgot to put closes number six in my commit message. I do that a lot. So what we're going to do is I'm going to take where we had when I did the commit has that issue six and then there's this string. I'm going to copy that. And now when I do get push, I'll do a pull first just to be sure. And this is some weird output. I'm not sure what's going on, but I think everything's fine. I will then do get push. I now know I can close my R and D B and if I go to my code and say look at data raw, for example, I see I've got download files using the command line. So it went through and this happened just a minute ago. So I'm ready to close my issue. And so I'll go ahead and here and I'll say that string from the commit message again, that was up here, that zero two four one zero CC. So that closes issue. And I will then comment and I will then close the issue. Very good. Thanks again for joining me for this week's episode of Code Club. Be sure that you spend time going through the exercises on your own to help reinforce your new skills. Even better would be for you to take the ideas we've worked through today and think about how they relate to your current projects. Can you think of data that you're downloading from the internet that you could be accessing with W Get? I'd love to see how you're adapting what I have covered in this and other Code Club episodes in your own work. Also, let me know what types of data analysis questions you have and I'll do my best to answer them in a future Code Club. Go ahead and put those comments down below this video. Be sure to tell your friends about Code Club and to like this video, subscribe to the channel and click on the bell so you know when the next Code Club video drops on Thursday. Keep practicing and we'll see you next time for another episode of Code Club.