 I believe you can answer your own data analysis questions. Do you? You should. Stick around for another episode of Code Club. I'm your host, Pat Schloss, and it's my goal to help you grow in confidence to ask and answer questions about the world around us using data. For today's episode, we're going to press the pause button on our data analysis, exploring the sensitivity and specificity of Amplicon sequence variance in microbiome research. To this point, nearly everything we've been doing has been through the command line. But before we start using R more for our analysis, I'm going to give you a broader perspective on why I find the command line interface to be such an indispensable part of my reproducible research workflow. When I fire up my terminal app, people that don't know me well often look on and discussed, no doubt asking themselves what century I was born in. Surely they think, this is 2020. We have beautiful screens, beautiful web apps, touch screens, amazing software with graphical interfaces. Yep, they're fun. I use them. But if your goal is to conduct a reproducible analysis, then you really need to step away from those tools and dig into the command line. Another reaction I get is, don't you know about this great R package that'll do that? I love R. But it's just not always the right tool for the job. Sometimes using R is like using a sledgehammer to pound in a nail. Or worse, sometimes it's like using a sledgehammer to pound in a screw. I could, but why? Doing the same thing at the command line with bash commands would be so much easier for many of these tasks. In today's episode, I'm going to give you my top 10 reasons where I think you should learn more about the command line interface. You should definitely learn a programming language like R or Python. But after you've gotten your feet wet with one of those, you should really strengthen your command line skills. Today's video will tell you why. Reason one, reproducibility. Yes, you can give written instructions to someone for how to get data out of software that uses a graphical user interface or GUI. It is tedious, and it is also likely to change if the interface changes. Think about how the instructions change for how to build a chart in Microsoft Excel. If you're in Windows or Mac OS X, or if you have an older or newer version of software, it's a train wreck. Added to that, most of the Glitzy software is proprietary or operating system dependent. But if you have a computer, then you can get a command line environment for free, and it's the same environment regardless of whether you have a Mac, Linux, or Windows computer. If you have a Linux type environment, then you should be able to reproduce my bash code. Of course, my code will be open and transparent, so you'll see exactly what I'm doing. Reason two, building a data analysis pipeline. Regardless of whether I'm using make, snake, make, or a bash script, it's all done through the command line interface. When I work on a project, my guiding ideal is that I should be able to run make clean to delete all the data files, and then run make manuscript.pdf and walk away and do something else. When I return, voila, everything from downloading the files to rendering a PDF version of my document is done without me doing anything. If I can do that on my computer and yours, I consider the work reproducible. To make that happen, everything has to be scripted using command line. Reason three, running bioinformatics and data analysis software. A lot of the software you'll run is research software, meaning that was developed by someone like me to do some piece of analysis. But we don't have the funds to develop a beautiful interface. So you'll get a command line interface because that's easier and cheaper to develop and maintain. Also, do you know that when you run RStudio, it's running on top of R, right? Again, odds are good that if you want to run R on your cluster, you're going to have to figure out how to run it from the command line and not in RStudio. Reason four, for doing things at scale. At a simple level, I might have a project where I need to rename a bunch of files to replace a hyphen with an underscore or make capital letters lowercase. This is relatively straightforward to do with a for loop in Bash. At a more complicated level, I often do large benchmarking analyses, or I have a bunch of parameters I want to test, or I might need to run an analysis 100 times with different seeds for a random number generator. Sure, I could do these analyses completely within R or Python. I'm sure there's a package for that. But it's often easier to fire off each permutation of parameters or seeds independently of each other from the command line. When doing this, on my university's high performance computing cluster, I can exploit the strengths of our scheduler software to run the jobs far faster than I could using those parallelization packages in R or Python. Reason five, organizing your project. The commands built into your command line interface come pre-installed to create and remove directories and files. Using these tools enforces a hierarchical structure to your project that makes it easier for others to see what you are doing and where the inputs and outputs come from. Another benefit of using command line tools to help organize your project is that it will discourage the use of spaces in the directory and file names and will encourage you to give your files and directories consistent and meaningful names. A benefit of having well-named things and good organization structure is that you can use command line tools then to operate on specific files or directories. For example, you can remove any file from a directory that contains the stub temp anywhere in the name. Doing this in Finder or Windows Explorer is not as straightforward and I find to be often very tedious. Organization is an undervalued part of making analysis reproducible and the command line tools make it easier. Reason six, getting and posting data. There's a treasure trove of data available for free that are best accessed using command line tools. If you're working with sequence data from the NCBI, you're going to need to use their toolkit or navigate the FTP server. All of that will likely be done from the command line. And when you're ready to post your data, you will make your data public, right? You will use the command line tools to post your data, especially if you're working from a high performance computer cluster. These tools are great because you can also direct where the data should go when you download it and where it should come from when you upload it. Combined with compression tools and archive tools that we saw in an earlier episode like GZIP, BZIP and TAR, your ability to use remote data expands. But if you use GUIs to do these tasks, things get started to get messy quick. I'll also tell you that once you figure out how to use these tools to download and scrape web pages, your world really opens up. Reason seven, performing simple exploratory data analysis. I have a file of DNA sequences in it. How many sequences are there? Yeah, you could find a special R or Python package to read fast day formatted files, install the package, learn to use it, apply it to read in the file and then count the number of sequences. That's a lot compared to using grab-c looking for the greater than sign. Although it isn't easier for everything, you can do a fair amount of data analysis using command line tools without the overhead of digging into a programming language like R or Python. I often find that the command line tools are most useful when there's a problem reading a file into R and I need to troubleshoot what's causing the problem with the formatting. Reason eight, mass editing of gigantic files. If I encounter a problem reading a file into R, I need to figure out how to fix it. Often that problem might be a formatting issue. I can use said to programmatically fix issues in the file. We've seen this before, like when we needed to replace spaces with underscores in the sequence headers of our fast day files. Perhaps we only want two or three columns of a large CSV file that has 20 columns. A simple cut command that we saw in a previous episode will do that task for us. Once I inherited a project where someone wrote like a 20 line Python script to do something that did something like concatenating a bunch of files together. I didn't know Python nor did the original author. I was able to replace that script with single one line bash command. Reason nine, having fun. There are a lot of great commands in the command line. Many of these are built into the default tools that come with Linux, but others you can install using your package management software such as apt or homebrew. There are serious things like BC which behaves like a calculator and GNU plot for generating plots. But then there's also fun stuff like Cal for showing you a calendar. Fortune for getting a random quote. You can even pipe the output from fortune to the say command to have your computer talk to you or to Cal say to have a Cal say it to you. I forget why, but part of our testing software for the package we developed had the line say Joe smells. In honor of my first PhD student, whenever we ran the test, we would hear Joe smells. We still laugh about that. Joe smells. Reason 10, you don't have a choice. Much of what I do is done on a high performance computer cluster. If you're doing an analysis with any amount of heavy lifting, you will too. It might be on your institution's high performance computer cluster, or it might be on Amazon's. Regardless, you are unlikely to be able to use a graphical user interface in those environments. If you're like me, you'll also be forced to use a scheduler to get access to the compute nodes on your high performance computer. And that will require using command line to execute the scripts that run your analysis. Please don't mention moving files to and from your clusters so that you can use your nice GUIs and minimize your experience on the cluster. I've seen that system breakdown so often that there's little chance that those analyses are going to be reproducible. You definitely need to develop your command line skills. You don't have a choice. Every tool has its strengths and weaknesses and context where it makes sense to use them. By learning tools, you're making an investment in your future self. If I was building a deck, I definitely wouldn't use a hammer to pound in screws. I probably could, but I shudder to think about what the outcome would be. That's a lot like cobbling together tools with those fancy graphical user interfaces and hoping the analysis will be reproducible. It might be, but eh. Instead, I might buy a corded electric drill and some nice drill bits to do the job. That's probably where you're at if you're watching this video. You have a good tool, does the job, but you still have this blasted cord slowing you down and tripping you up whenever you turn around. If you were building a big deck or could see other DIY projects in your future where you'd use a drill, you would probably invest a nice cordless drill, perhaps an impact drill with a good battery. Similarly, as you learn more about data analysis, I'd encourage you to add more tools to your toolkit, being mindful of where it's appropriate to use them and when you're really just trying to bang in a screw. Thanks again for joining me for this week's episode of Code Club. Be sure to tell your friends about Code Club to like this video, subscribe to the Riffamonus channel and click on the bell so you know when the next episode drops. Keep practicing and we'll see you next time for another episode of Code Club.