 I believe you can answer your own data analysis questions. Do you? You should. Stick around for another episode of Code Club. I'm your host, Pat Schloss, and it's my goal to help you to grow in confidence to ask and answer questions about the world around us using data. In the last two episodes, we've created a few bash scripts and a make file to help automate running those scripts while keeping track of all the dependencies. You may also recall that we aren't tracking changes to our large data files with version control because they're just too big. But with our bash scripts and make, we can regenerate them if they somehow got deleted or we needed to start over. Of course, you should also be able to use my files to reproduce what I've already been able to do, which is a win for reproducibility. Because these tools are so powerful, we want to spend a little bit more time with them. In today's episode, we're going to clean up our scripts and make file a bit. If you look at our make file and bash scripts for downloading the RNDB files, you'll see that we pass in the name of the file we want to download. Well, we didn't include the path indicating where it should go once it's extracted. Instead, we hard-coded that it should be outputted to our data raw directory. But if I ended up wanting to change where I put the files, then I'd have to change the bash script rather than the make file rule. Today, we'll learn a bash command, said, SED, that will allow us to give the script the file's name with its path. I admit this is a relatively small improvement for our overall project. But along the way, we'll learn a few other bash tools, including the pipe and how to capture output to a variable. We'll also see how we can use variables that are built-in to make that simplify our make files. Because these improvements will make our analysis easier to develop, they'll also make them more reproducible. Even if you haven't been following along with the past episodes, I'm sure you'll get a lot out of today's video, so please stick around. Also, take the time to follow along on your own computer and attempt the exercises. Don't worry if you're not sure how to solve the exercises. At the end of the video, I'll provide solutions. Well, if you haven't been following along but would like to, welcome. Please be sure to subscribe to the channel and click on the bell icon so you know when the next episode is released. Please check out the blog post that accompanies this video where you'll find instructions on catching up, reference notes, and links to supplemental material. The link to the blog post for today's video is below in the show notes. I've gone ahead and created an issue for today's Code Club episode, where I want to refactor my code, get rndbfiles.sh bash script to parse out the path and file names. So this is going to be issue 11. If I go to my terminal, I'm already in my project directory. Again, I have this in my documents directory, Schloss R&D Analysis XXX 2020. It doesn't matter where you put it, as long as you know where it is. All right, LS, I see my files here. Just to clean things up a little bit, I'm going to go ahead and get rid of those mother log files that I told you before, store the output of running that align seeks command that was in the one shell script we created. I'll do get status. See that everything is up to date with Origin Master. That's GitHub. There's nothing to commit. My tree is clean. I want to go ahead and create a branch for today's issue, issue 11. So I'll do get branch, issue underscore 11, get checkout issue 11, switch to branch issue 11, get status, we're all good. I'll go ahead and open up Adam in my terminal. And let me open up my make file, as well as my get R&DB files script. So what you'll notice in the make file rules that I have here for getting those, because there's four R&DB files, is that I call the shell script and then I give it the file that I want to be sure to download and extract. And you'll see that instead of giving the full path, so there's no data raw here, it starts with the R&DB and goes through the rest of the file name. What I'd like to be able to do is say something like this, right? I would rather not have to worry about whether or not to include the path. I'm gonna go ahead and save that for now because that'll be a good test case for what we're trying to do. And if I go to my get R&DB files, we see that we read it in as an archive and we then use Wget to download the archive to data raw and we then unzip it, unzip that archive. That was in data raw with a zip and we extract it back to data raw. And we talked about this in previous episodes where we talked about Wget, zip and tar. We then touch it because the files that were extracted have a very old timestamp from when they were created on whatever computer generated them versus the timestamp for when I'm actually generating them or downloading them and extracting them. So what I'd like to be able to do again is to take in the target name, take in the full path and pull out the path as well as the file name so I could perhaps put a path variable here and here, put in the variable name here and then perhaps extract the file name like we have here for archive and put it there. And then I could again replace that with the full path. To illustrate a few concepts with set and these other bash commands we're gonna work through today, I'm gonna create a variable target which will be data raw, rntb16sRNA.fasta and again I can see the value of target by doing echo dollar sign target. The general idiom or the general way of writing a said command and said is the stream editor which will allow us to do operations like find and replace. So if you've ever used Microsoft Word and you've gone through and you found one word and replaced it with something else, in command line and bash you can do that with set. Basically every programming language including bash has some way to do this operation and I find it really powerful as a way to work with file names. The general way to do this is echo dollar sign variable name variable and then a vertical line and I'll come back and tell you what everything does here. Said quote s forward slash find forward slash replace curls across my screen forward slash close quote. And what this does is this will output the value of my variable, right? Kind of like we did here with echo target so it output that and that vertical line which is the above the backslash so shift backslash and the backslash is right above your return or enter key on your keyboard. That vertical line is a pipe. It's nice that it's a straight line. Looks like a pipe. So we pipe the output from echo into said and again, do you think of pipe like flowing water? And it goes into said and then with said you have the quote marks and you have s and s is short for search and then you have these forward slashes separating out find and replace values. And so what we wanna find we put in find and what we wanna replace that with we put where the replaces and then we end it with that final forward slash and the closing quotes. So this won't work because well at least doesn't give me an error but there's no variable called variable. So if we think about target if I do echo dollar sign target said s forward slash raw forward slash. I've talked about this a few times. I won't do it here, maybe we'd rather have the raw directory stored as RNDB and then forward slash closing quote and run that. And we see that our path name has gone from data raw RNDB to data RNDB forward slash RNDB. So we took the word raw and we replaced it with RNDB. What if I wanted to change the data raw part? That's got a forward slash there. That could be kind of confusing. So echo dollar sign target said s data forward slash raw forward slash RNDB forward slash close quote. And so this will throw an error because it doesn't know what to make of that because I've got a forward slash in my search term. So what I'm gonna do is I'm gonna put a back slash before that forward slash and that's called an escape character so that means actually match a forward slash. So we run that, everything is good. Said doesn't really care if you use forward slashes or not as long as you use some character that you have three of. So we could do echo dollar sign target said quote s let's do underscore data forward slash raw underscore RNDB underscore quote. And so we used forward slashes in the previous examples. Here we're using underscores and that does the same thing. I prefer to just stick with the forward slashes and to use the escape characters if I'm writing out paths but you do what works for you but know that you don't necessarily have to use those forward slashes. You can use any character as long as you've got that set of three. We have the s forward slash or s some character, the search term, that character again, the replace term and that character again. Again, let's think about what we're trying to do here with parsing out the path name as well as the file name separate from that and how we can take what we've already talked about with said and apply it to our target variable here to get those two different elements. And so what we'll end up doing is two different said functions on separate lines of code. But we wanna think about how we can adapt what we've already done here to generate those values. So I'll do echo dollars on target, said quote s forward slash data back slash forward slash raw forwards, back forward. Cause I wanna get that final forward slash so that my file name doesn't have any slash to start it. So that's my search term, right? So my search term is data raw forward slash and I'm gonna replace it with nothing, okay? So the two forward slashes means replace it with nothing and so we run that. And sure enough, we get back out our file name. Alternatively, we could do echo dollars on target, said r and d b, 5.6, 16 s, underscore r, r, n, a dot fasta. But I forgot my s, my forward slash, and then I'll replace that with nothing, okay? And close quote, run that, and then we get our path. So one of the problems with these last two said functions, said calls was that they're very specific, right? And what I'm trying to do here is be more general. And what we can do with this first one is we can replace it with what are called meta characters. All right, so for my search, what I could do is the period character represents any character. So a space, a letter, a number, hashtag, that period will match any symbol. If I put a star after it, it will match zero or more of any character, right? So period star will match a string of any length of any characters that you throw at it. And then it will replace it with nothing, right? So the output there is nothing. What I'd like to do is cut off that search term by putting in a forward slash. That star is what's called greedy. And so it will try to match as many characters as it can until it runs into this forward slash. If I run that, it goes to the last forward slash because it's greedy again and returns my file name. So that's awesome. So for now, I'm gonna go ahead and copy this into my script. I'm not done with it, but I'm gonna save that so I don't forget it. So we've got the file name. Now we wanna figure out how do we get the path, right? So what I'd be tempted to do, might be to do echo dollar sign target, said s forward slash, maybe I'll put in the forward slash star, period star, and that would replace everything after the final forward slash, okay? But that doesn't work. That only returns data, the first part of the path, because again, that star operator is greedy. It's gonna try to match as many characters as it possibly can while still matching the full search term. So we need to do something a little bit different. I will do echo dollar sign target, said, and I'm gonna rewrite what I had for extracting the file name. So I'll do periods star, back, forward. So that's what we had before for extracting the file name. I'm gonna go ahead and put in the file name. So periods star to represent the file name, and then my two forward slashes. What I can do is I can put parentheses around the search pattern that I actually want to match, and I can then output what actually matched those parentheses. So I can do, but I have to use backslash. So I can do backslash parentheses at the beginning of the search, and I'll do backslash close parentheses after that final forward slash, and I can output what's in that parentheses by using a backslash one. And that gets me data raw, okay? So again, the parentheses wrap around whatever I wanna save, and then output in that replacement slot. Now, perhaps you might forget to put in the dot star to represent the file name. That will output the same thing because it's not replacing the file name. But if I put in the dot star, it's gonna replace the full thing, right? The path and the file name with the replacement, the backslash one, which is the path. But the backslashes on the parentheses gets kind of annoying, right? There's a special flag that we can use then that makes it so we don't have to use the backslashes on the parentheses. So we can use hyphen e, and that e is short for extended regular expression. This pattern with the dots and stars and parentheses and everything, that is called a regular expression. It's a fancy term for the search term, but that dash e allows us to do more sophisticated things in those regular expressions. So again, if I run that, I get back data raw. Very good. Again, I'm gonna copy this and put this here for my, I should write down what they are, right? So this is the path, and this is the file name equals that. And I've changed archive. What I'm working on is really the target, right? So I'll say target equals dollar sign one. And it's the target that's being fed in the make rule into the recipe. Let's go into bash and let's see if we can how we can then assign this back to a variable. And so again, this is my path. So I might be tempted to path equals echo dollar sign target. That doesn't work. What I need to do is if I put back ticks around a bash code, then that output will be assigned to the variable. So I could do something like files equals back tick LS. And what that'll do is that we'll run the LS function and assign the value or the output from LS to my files. So if I do echo dollar sign files, files not file, I then get a variable that contains all the files in this directory. So we wanna do the same thing where we do path equals back tick echo target dollar sign. And I'm gonna closing back tick as well. And then if I do echo path with the dollar sign, I get my path, which is awesome. So I can take this over here now and put a back tick and sometimes Adam will put an extra back ticks. It's trying to be helpful. So what you can do is you can highlight the whole thing. You want in back ticks, hit the back tick and it then gives you the open and closing back ticks. Wonderful. To test this, I'm gonna go ahead and comment out these lines and I'll do echo dollar sign file name, echo dollar sign path, save that. And my make file, I've already put the target where the file name had been. So I can do make space my target and it then outputs my file name as well as my path. So it worked inside of my bash script. I'm very happy. Now, I actually wanna use these things and I don't wanna echo them. So I'll go ahead and uncomment these lines and now I need to replace the values I have in here with my file name and my path. So this is gonna be dollar sign path. Wherever I see data raw on its own, I'm gonna replace with dollar sign path. I'm gonna leave these here because this is my target, right? So dollar sign target and then this is gonna be dollar sign target dot zip and as we talked about before, it's probably best to wrap these in quotes in case there happened to be a space in my target file name. I know there isn't, but to be careful. And then this dollar sign archive, I'm going to replace with file name and I think I've updated everything. So I'll go ahead and save that. I'll come back to my terminal and I'm going to go ahead, going to remove that target and then make it again to make sure everything works. Very good. And we see that it was just updated and just processed. So that worked wonderfully. Looking at our make file now, one thing stands out to me is that I've got a little bit of repeated code. This is not a big deal, but make has some things built into it called automatic variables. Now I'm gonna talk to you about one automatic variable now and in the exercises, you'll learn about two others. So automatic variables are nice, but as we've already seen, you can get a decent ways with make without having to worry about automatic variables. So if automatic variables are confusing to you, don't worry about it. They're there to make your life easier, not to confuse you. What we can do is there's an automatic variable called dollar sign at, okay? So dollar sign at is the same thing as the target value for the rule. So I can use that dollar sign at wherever I have the target, right? So this also now, because I've updated the same script that this calls, I can update all of these to have dollar sign at's on them and wherever make sees dollar sign at, it will use the target name in place there. I'll save that and I'll go ahead and generate this other target and it should run because I've updated my script that it uses as a prerequisite and it runs that. And if I again, LSLTH, data raw, I see that, yep, sure enough, that files there as well. And I'll go ahead and I'll generate these other targets and the last target to regenerate is my alignment file. And we all know that takes a while. So I'll go ahead and run that while you all work on the exercise. As always, I have three exercises for you all to go through to practice honing your skills with what we've talked about in today's episode. In the first exercise, I'm asking you to explain what happens if we leave off the echo command in our script. I presented echo as telling you that it outputs the name of the file, but what actually happens if we don't include the echo? What kind of error message do we get? And do you remember where we've seen that error message before? Next, what I'd like you to do is write the command to extract the variable region from the path data v4 rndb.align. This is something that we'll probably work on in the next episode. How would I get that v4 to be its own variable? And I'd like you to save it to a variable named region. Try to make your regular expression as general as you can. The starting place is to be specific, right? To say let's match data, remove data, match rndb.align, remove that, but see if you can make it as general as possible. In the third exercise, what I'd like you to do is to look more at these automatic variables. So we've already seen that the dollar sign at represents the target name and then uses that in the recipes of our make file for that rule. You can also use a dollar sign carrot, so the upward pointing character that's above the six on your keyboard, to represent all of the names of the prerequisites. You can also use the dollar sign less than sign to represent the name of the first prerequisite for your rule. Think about how you could edit the rule to generate data raw rndb faster, that FASTA file, as well as the alignment file, so that the recipe for each of those rules only includes these automatic variables. And I'd like you to consider the pros and cons of using these automatic variables. Go ahead and work through these exercises, pause the video to give yourself enough time, and then once you're done with them, go ahead and press the play button and I'll show you how I work through those solutions. Well, I hope you found those exercises engaging and helped you to stretch your brain muscles a little bit to hone your skills on the new material from this episode. For the first exercise, again, using our target variable, I asked you to run this bash command and then assign the output to the path variable. You'll recall that previously, we had echo here before target. And so the question is, what does this error mean? What does permission denied mean? Well, we saw this when we were talking about bash commands and bash scripts two episodes ago. And what it's trying to do is it's trying to run the command represented by target. So it's trying to run this file as a executable. But if we do ls-lth on that file, we see that we only have read write permissions for that. It's not executable, okay? So it can't execute it. And again, what it's trying to do is execute that file and then with the output of running that file, run set on it. But because it can't execute it, it complains and it just crashes out. Now, if somebody was running this code and running into this problem, one thing I could foresee that they're trying to do is to run set on the contents of a file. That's not something we talked about. And it's not something I do a whole lot, but what we could do would be to do, say, set-e, quote, s, forward slash, and then the parentheses. So this is basically what we saw before, but it was receiving the output from echo. What you could do is then this less than sign, the name of the file that you want, or dollar sign target, right? The name of the file. And so what this does is this takes the output or the contents of target and funnels that into set. And so that less than sign, I kind of think of as an arrow operator, right? And so you run that and I forgot my closing quote. So I'll put that in. It's gonna complain. So I'll put my closing quote there and get rid of this other closing quote that should work. And again, it's gonna output everything to the screen and it's gonna run that set operator on every line of the file. I can hit Control-C, it finished. You can hit Control-C if you're sick of the output. Again, what we saw in this output or in this line is that it takes the contents of target, shoves it through set, runs set on each line of the file. But again, what we've been doing is to do echo dollar sign target and that echoes the value of target. It repeats the value of target and then runs that output. So that string, that file name through set so that if we do echo path, we again get our path. In the second exercise, we're given a new path and so I will then say, I'll call it path equals data v4 forward slash rndb.align. And what I'd like to do is to pull out that v4. What we could do is what we could do, echo dollar sign path and then set that s forward slash forward slash forward slash close, right? And then we go back in and we fill in the find and replace. So I could do data back slash forward, replace that with nothing, pipe that to set and then back slash forward rndb.align forward slash forward slash and the quote. And I forgot an extra forward slash and that returns v4, right? But again, that's very specific. We'd like to have a simpler way that's perhaps more generalizable to other paths we might get. So I could do echo path and I could use said and I could have the dot star to match the data part and I could again have that forward slash and then dot star to represent the v4 part and then a forward slash. So I need a back slash before that as well as a dot star to match the file name and then replace that. And what I'd like to do is to wrap the middle part that's between data, the data directory and my file name with parentheses. And I don't want the forward slash on that and then I can push that out to variable one. When I run this, I get an error because you'll recall if I don't use the hyphen e then I need back slashes on my parentheses here in my search term. Again, what I can do is dash e and that works and the output then is v4. So hopefully you found that a little bit challenging to practice working with using said to parse paths to get useful information out of that. And we'll definitely see that again in a future episode because what we're gonna wanna do is perhaps look at how Amplicon sequence variants and their performance behaves for different variable regions of the sixth nest gene and so maybe I'll put alignments for different regions into different directories and I'll wanna be able to pull out that region to know what region we're looking at. Finally, I asked you to look at using these automatic variables to get, to simplify the text a little bit. In the example that we've been working with for the R&DB fast day file, the first prerequisite is the only prerequisite so we can replace its name here with a dollar sign less than sign and you'll notice that I don't have the period forward slash and it actually turns out that we don't need the period forward slash when we're calling our shell scripts. If they're executable and if they're in a different directory. So if they're in my current working directory, I need the period forward slash but if they're in a different directory, I can put in code forward slash, right? So it clearly works if you leave it in, it also works if you leave it off but I'll save this and then I will go to do make, I'll remove that target, make N to see what it looks like and we see sure enough it puts in the shell script as well as the fast day file and if I then run that, everything works swimmingly. Another example of where this comes up and is a little bit harder is in this example for generating the alignment. You'll notice that the first prerequisite is the reference alignment but my third prerequisite is the script that's actually running this. So if I want this to be dollar sign less than I get the two confused all the time. Then what I need to do is reorder my prerequisites which is perfectly acceptable. So I'll cut that, paste it up here, put the backslash on, tab this over to make it look pretty and then remove that final backslash, get rid of this extra line and save it and then I can do make dash N on that and it tells me then it's gonna run code alignsequences.sh. Again, I'm not totally convinced that I wanna use all these automatic variables. I don't think this is super readable. I don't think this tells me a lot about what's going on. I do like having the name of the target with the dollar sign at but again, it doesn't really matter. It's again meant to speed up your coding time. I suppose if you were to change the name of this shell script then you would also have to change it here and so using that as a variable is handy but the cost is that it's not as readable. So I'm gonna go ahead and I'm not gonna use those extra automatic scripts, automatic variables sorry and I'll leave it as this. So I'm ready to close out my issue, get status. I've changed my make file as well as my getRNDB files file. I'm gonna go ahead and do get add make file, code getRNDB files, get status, everything a stage ready to commit. I'll do get commit dash M and a refactor code to parse path and file name from target closes number 11, that's two episodes in a row. I remember to put the closes in the commit message. Very good. I'll then do get checkout master. Come on my master branch, get merge issue 11, very good. And we can get push and we see that the issue has been closed. Thanks again for joining me for today's episode of Code Club. Be sure that you spend time going through the exercises on your own to help reinforce your new skills. Even better would be for you to take the ideas we've worked through today and think about how they relate to your current projects. In the blog post for today's episode is a link to a great tutorial on using said. Whenever I Google something about said and variably return to that page, I'd really encourage you to work through that tutorial if you have the time. I'd love to see how you are adapting what I've covered in this and other Code Club episodes into your own work. Also, please let me know what types of data analysis questions you have and I'll do my best to answer them in a future episode. Be sure to tell your friends about Code Club, to like this video, subscribe to the Ruffimona's channel and click on the bell so you know when the next Code Club video drops. Keep practicing and I'll see you next time for another episode of Code Club.