 I believe you can answer your own data analysis questions. Do you? If you do, stick around for another episode of Code Club. I'm your host, Pat Schloss, and it's my goal to help you to grow in confidence to ask and answer questions about the world around us using data. I love starting new projects. Everything feels so simple, so obvious. Then weeks, months, years after starting, things have gotten complicated, confusing, and tedious. We're ready for it to be done already. With each new project, I tell myself this is going to be different. Well, today is the day we're going to start a new project. We have the goal of answering a question that is important to those of us interested in analyzing microbial communities. To what degree do Inter and Intra genomic variation limit the interpretation of Amplicon sequence variants, also called ASVs? Along the way, we'll learn different elements of what it takes to make an analysis reproducible and how to use a variety of tools that will help us out. We'll get started in this session with using the command line to set up our project directory. Even if you don't find the problem we're studying interesting, you'll hopefully find the approach generalizable to a variety of problems that do interest you. I'm a microbiologist, and one of the things I spend my days and my career doing really is sequencing a gene called the 16S ribosomal RNA gene. This is a gene that's found in all bacteria in archaea. There's a gene called the 18S ribosomal RNA gene that humans and other eukaryotes have in their genomes. This gene is very popular because it has a signal with it that tells us something about who the gene came from, and so it's useful for doing taxonomy of bacteria and other organisms. Over the past 30 years, this gene has been used extensively to better understand the types of bacteria that are in different communities. If you've heard about the human microbiome project or anything related to the microbiome, that's largely been driven by sequencing the 16S gene. Now we have a few different approaches that people have taken to analyzing these data. The first is to compare the sequence to a database. We have a database of sequences that come from perhaps genomes or cultured bacteria, and so we can take my sequence that we've perhaps never seen before, compare it to the database to figure out what approximation who it's related to. This is what I call phylo typing. Other people maybe call it closed reference, but this is what I call phylo typing, where we're trying to figure out, so to speak, a name to put on that sequence. So the second approach that we've been using for about 20 years is to take our collection of sequences, and for a given sample these days, we might have five, 10, 20,000 sequences in a sample, and because we know the databases are limited, what we'll do with this approach is we'll look at the similarity of the sequences to each other within our own sample, and we'll bend the sequences together based on how similar they are to each other. This is what's called an operational taxonomic unit, or OTU, and generally we're happy if we have a collection of sequences that are not more than 3% different from each other, or you could say a collection of sequences that are more than 97% similar to each other. There's a long back history to this, people kind of think it relates to a species definition, but that's pretty wobbly. It's an operational definition is what's important to know, and the other thing that's important to know is that we're assigning sequences to a bin, this OTU, based on how the sequences relate to each other, not to how they relate to a database. The third approach that's more recent, and say the last 10 years, and it's really picked up steam in the last couple of years, is what's called amplicon sequence variants, or exact sequence variants, or oligotypes. They go by kind of their abbreviation of ASV, ESV, and oligotype, and the idea here is that we're binning sequences, again based on how similar they are to each other, but instead of having some room for a variation within an OTU, these do not allow any variation. The name exact sequence variant means that the sequences within that bin are identical to each other. There's no variation. Now, that's a little bit overstated, and so they've kind of relaxed the definition, and so the algorithms people use generally allow for a couple of differences in the sequences. So these sequences that we generally use are about 250 nucleotides, up to about 1500 nucleotides, and so a couple of nucleotides is still going to be less than 1% different. So what we'll talk about are called amplicon sequence variants as kind of a catch-all. I think that's kind of the term that's in use these days, but know that if you've heard of terms like oligotype, amplicon sequence variants, or ASV, or exact sequence variant, ESV, I kind of refer to the same idea. And so there's been a lot of bluster, I think, in kind of very strident positions put forth that ASVs should replace OTUs as the biomarker level of choice of what we should use, right? And so keep in mind that phylotypes, OTUs, and ASVs, they're all kind of a form of an OTU, and so our language is a little bit sloppy, but an ASV is an OTU with a much more constrained level of variation. And so again, this kind of strident take that ASVs should replace OTUs, like I said, I feel like it's a little bit over the top. And I've written a few reviews of manuscripts people have posted to bioarchive, and I've put these reviews online to make them public. And there's a number of reasons that I worry about the, this kind of adoption, the strident adoption of ASVs and getting rid of OTUs altogether. In my group, we use all different levels. And again, I could go into that, but that's not really what I'm interested in here. What I'm interested in here is a particular concern I have with ASVs. And that's that one of the challenges with using 16S ribosomal RNA genes is that an individual genome might have 21 copies. It might have one copy, things like E. coli have seven copies. And so it varies, again, across the phylogenetic tree. There is some conservation within each group. And so that would cause, that would only cause problems perhaps for quantifying the relative abundance of different bacteria. But the added problem is that these different copies in a genome are not identical to each other. E. coli's seven copies, if we look across the full length of the gene, there's six different sequences across those seven copies of the gene. And so when we talk about looking at Amplicon sequence variants or exact sequence variants, so to speak, we run the risk of splitting E. coli into six different bins, six different OTUs. Is that good or bad? I'm not convinced that it's so good. Alternatively, there's also the risk that if we look across genomes, say, at different species of the same genus, that there might be ASVs that are shared across species. And so the ASVs aren't as fine scale as we might think they are. So that's the question that I have, and that we're going to spend the next series of Code Club episodes working on to answer. Now, we could pick any question. That's a question I work on professionally, this area of microbiome research, and it's one of the questions that I've had over the past few years as more of these methods come out and as we see these methods being applied. As I hear people getting comments from reviewers saying OTUs are dead, long live ASVs. So I think it's worth checking out this type of question. And as I said, as we go through, we're going to highlight different concepts and practices related to doing reproducible computational research. To answer this question, we're going to leverage the great resource that's the RRN database, the RNDB, that's housed at the University of Michigan and actually maintained by my colleague Tom Schmidt. He and I share a lab space and our offices are pretty close to each other. This database is quite large. The last time I'm aware that somebody's done some type of analysis like this, looking at inter versus introgenomic variation of the 16S gene was in 2013. And those researchers only had a couple thousand sequences. The RNDB currently has about 16,000 sequences. About 15,000 of those are for bacteria distributed among about, I think, 4,500 different bacterial species. And there are also archaeal sequences from a few hundred archaeal species. So the database is much larger right now. And will really allow us to look at these unique questions as they relate to ASVs. Even that 2013 study wasn't really thinking about ASVs when they set up their analysis. So for today, what we're going to talk about is something that might seem very boring, but is the area of project organization. I've had the problem where I open up a directory from myself or one of the people in my research group, and I just see this pile of a thousand files and there's no organization. So how do we organize our data to make it easier to navigate and work with our project data? One of the papers that's really motivated my thinking on reproducible research in general and project organization in particular is a commentary that was written in Plos Computational Biology by William Stafford Noble. In this, he talks about a variety of factors. But again, what I want to focus on today is project organization. I would really encourage you to go into the notes for today's Code Cook Club episode and to dig up his paper and give it a read. It's really great. So you'll notice that he's a computational biologist and what he's recommending is based on his experiences and the types of projects he's worked on. So I've adapted his system of organization based on my experiences and the types of bioinformatics and microbiome projects that I work on. And so that's what we're going to talk about today. As you proceed with your career, you might find other ways of organizing data and organizing projects that are more effective for the type of work you're doing. And so you might learn from Noble, you might learn from Schloss to make your own improvements that again work for you. But there's a few principles that I hope you keep in mind as you do this and that we've tried to bake into how we organize a project. So the first is that anybody that looks at your project directory should have a pretty good sense of what's going on. They should know where things are. If you want to find the raw data, where would that be? If I want to find the code, where would that be? If I want to find any figures, where would those be? And so those types of questions should be obvious from looking at the directory structure. It's a form of documentation. Keep in mind that the person looking back at your directory structure to make sense of things might be you many months from now. It could also be your PI who's trying to figure things out long after you've left the lab. A few other principles that help you to get to this goal include things like making sure that all the files and data for your project live in one self-contained directory. Imagine giving me that directory for me to put on my computer. Would the project hold up? Would it be cohesive? If you're a file geneticist, you might think of this as being monofiletic. Another thing to keep in mind is that the different types of information should be kept separate. So your data should be separate from your code. Your raw data should be separate from your process data. Your references should also be separate from that. Next, we should also assume that any code that's going to be run on our project is run from the top of that directory. We might call that the project route or the project home. But any operation is going to be done from the root of your project, not from within some sub-directory. And the reason that's important is because if you run your analysis and you of course run into a bug, as everyone does and that crashes out, well how do you know where to restart from? So if you're moving directories as you do the analysis, then it gets kind of tedious to try to figure out where you are. Meanwhile, if you keep everything as if you're running it from the root of the project, then you never really have to worry about where you are. Another thing that I like to do for my own consistency and kind of my own style is that for the most part, any directories, any files in my project should be in lowercase and not have spaces. So I don't really care about how you capitalize things, but do not put spaces in your file names or in your directory names. They only cause problems for bioinformatics analysis. Now, if you need a space or feel like you need a space, instead of using a space, use an underscore or if you have to use a hyphen. With these ideas in mind, let's take a look at a generic project that we'll use to form our own project directory. What I'm showing here is a generic organization structure for a project that we might do. You'll notice across the top that the directory that we're in is the project directory that we'll call lastname underscore project underscore journal underscore year. So the last year would be my last name or whoever's running the project. So it might be Schloss in this case, project would be like a one word or maybe two word blurb describing the project. So I might call this RRN analysis, all kind of concatenated together as one word. Journal, well, that's a little bit aspirational at this point, we're just starting. I hope this could be publishable, but who knows. So at this point, I might just put a couple of X's in there because I don't really know what journal I wanted to go to. And then finally, the year would be the year that we're in. Again, the hope is that this does get published or whatever project you're working on gets published. And so if you've got a series of projects, we could look at the last name, the project, the journal and the year to easily find that project that relates to the paper. You could argue with me that perhaps the year should go first. But whatever what's most important is to have some type of structure to how you name your projects. This again is something that I have found to be useful. Within this project directory, we need a directory called code. This is obviously where all our code is going to live. We'll need a directory called data. Again, where data is going to live within that. I'm I generally have maybe four or so different subdirectories. So one is going to contain my reference files. One's going to contain my raw sequence files, perhaps one is going to contain the files generated by the mother software package. And one might contain processed files or output files that are perhaps more consumable by someone that wants to get access to our data. Okay. We also have an exploratory directory. This might be a directory where I put notes related to different types of analyses I've done, kind of like you might use a laboratory research notebook. And finally, again, hoping that this might someday be published, a directory called submission, which is where I put everything related to writing a manuscript that I would ultimately submit to a journal. And finally, there's two files at the root of this project, one being a license, which indicates how other people can interact with the data. And then the final one is a read me document, which again, we'll put notes in there about the status of the project. And we'll come back to all of these later. The read me file will be very useful down the road, as we start to put more information in about the project, different dependencies, different versions of software we're using, and perhaps we how we do different stuff, I could generate this using finder on a Mac, or using Windows Explorer on Windows. If you're using Linux or something very similar to this as well, I'm spacing on what it's actually called help develop our skills, working from the command line. What I'm going to go over today is how we're going to set up this structure from the command line for our own particular project. Again, being familiar with some of these command line tools will be very helpful in quickly manipulating files, quickly setting up analyses, and then down the road, running analyses from the command line. When you start your command line interface, you'll get an interface that looks a lot like this. The actual syntax that's here on your command line interface might be a little bit different. Mine has my user ID for my computer, the name of my computer, this tilde, and then the dollar sign. The dollar sign tells me that I'm at a prompt, and that the command line interface is ready for me to give it a command, ready for me to tell it to do something. What I can do from here is I can type ls and this will list out the different directories that I have within my current directory. You'll have different directories, but what you should notice in here are things like applications, desktop, documents, downloads, dropbox. I don't know why I have so many dropbox directories. And again, if you're on a Mac versus Windows, you're going to have different directories, but you should get the gist of this. Also, if you look at your Finder or your Windows Explorer window, you should see content of this directory that look a lot like what you see in your command line interface. So for now, this graphical interface to look at the directory structure is a helpful tool, but hopefully we'll kind of develop our strengths with using the command line interface so that we don't have to rely so heavily on using this graphical interface. Okay, so ls lists out the contents of the directory. What I'd like to do then is to change directories that I'm in. So if I do cd documents, you'll see that at least on my computer the prompt changes from this tilde to documents. Now that tilde indicates the home directory. So if I type cd tilde at any time, I'll go right back to my home directory. But for now, I want to be in documents. Okay, so if I want to see what's in my documents directory, I can type ls. If I do the same operation using my Finder window, I go to my home directory, documents, I can see the different directories and files that are in my documents directory. And I can see that those are here as well. One other thing to point out is I can use the command pwd. And this says print the working directory pwd print working directory. And this will then print out the directory I'm currently in. You'll notice that it has forward slash users forward slash pschloss forward slash documents. This tells me where I am right I'm in my documents directory off of my home and my home was users pschloss. The other thing to note is that this path as it's called starts with a forward slash that initial forward slash is the root of my directory structure, the root of my computer, you can think of your computer as being a lot like a tree where it's got a root at the very bottom where there's one directory the root, and then within the root there's multiple directories and as you kind of go out the tree grows. Within my documents directory, I would next like to create the directory that my project is going to live in. To create directories, I can use the mkdir function. It's actually a program. Anything we type here is a program, but they seem so simple that we just want to call them functions, or I do at least. So I'm going to make a directory that I'll call schloss underscore rrn analysis underscore, I don't know what journal underscore 2020. Okay, so this is going to be the name of my project directory. And so it exists. If I look at my finder window in the documents directory, I now see that I have this schloss rrn analysis project directory to move into this directory, we're going to change directories. And so we'll use the program CD. So CD schloss. And this can be kind of long. So if I hit the tab key, it will automatically complete the rest of the directory name for me. So again, I could do CD schloss, and then hit tab and it'll complete it for you. You could also type it all out. But as you'll find, this is very sensitive to typos. And it's very easy to introduce typos when you're doing any type of data analysis. So tab is your friend. So now we're into our project directory. And if I want to see what's there, again, I can type ls. It's empty. I'd like to start filling this with different directories. So I'm going to go ahead and use mkdir to create these directories. Now I could do mkdir code, enter, and then all the others, or I could put all the directories on a create on one line. And so I'm not going to do all of them this way because I want to show you how to do something else down the road here in today's episode. So I'll do code exploratory, exploratory, submission, and data. So ls shows us that the directory is empty. But again, I want to put in directories into my project directory so that I can have this organization. And so there's four directories that I want to have in my project. And I could do mkdir, and I could create each of those in separate mkdir function calls, or I could put them all on one line. And so that's what I'll do here. So I'll do code, exploratory, submission, and data. And now if I do ls, I see my four directories here in my project. I can also create files from the command line. These won't have anything within them. But they'll at least create an empty file that takes up space in my project directory. And so the function that we'll use to do that is called touch. So I just think of it as touching the file and touching it into existence, kind of like a fairy godmother or something, right? So the two files I want in the project route are license.md and readme.md. Okay. So if I type ls now, I see my four directories, and I see my two text files, the license readme, the license md, and the readme md. Now, when you look at this, it might be a little bit hard to know what's a file and what's a directory. So if you want to make that clearer, we can give ls a flag, as it's called, hyphen capital F. And what that does is that puts a backslash or forward slash, I'd say, at the end of our directories, but doesn't change our files. And so it's much easier in this output to see what things in our project directory are directories in which are simple file. As I mentioned earlier, PWD gives us the path to where we currently are, to our working directory. It prints our working directory, PWD. And so this is an absolute path, because it goes all the way back to the root of our computer. Now, I've been calling this Schloss directory our project route. And so what we want to do is do everything from within the project route and relative to that project route. So the path that's relative to the project route is going to be called a relative path. And what we really want to be doing is working with relative paths. We want to describe everything relative to the project route. And so as an example of that, I could do I could make these like the raw data directory, I could change directories into data and then make raw, but I don't want to do that. I want to stay at my project route. Again, because I don't want to change directories, because I'm lazy, maybe, but more so to kind of emphasize how we work with relative paths. So we do mkdir data forward slash raw. And if we then look to LS, we see LS generates what we've been seeing. But if we do LS data forward slash forward slash, we see that it now has a raw directory within that. I guess I can make that more clear using the dash F, right? So I also have other directories that I want to put in data. So I could do mkdir data forward slash references, data forward slash mother, data forward slash processed. And again, if I do LS dash F data, I see the directories that are within data. And this is all doing it without leaving my project route directory. If we wanted to get a sense of the overall structure of our project, we could use LS dash R, which is short for recursive. And this will then output the contents of all the directories. So we see at the top of the project, the project route, we have our two files and our four directories. Code has nothing in it. Data has four directories in it. Each of those four files, I'm sorry, each of those four directories are also empty. And then exploratory and submission are also empty. Something you'll notice is that before code and data and exploratory and submission, we have a period forward slash. So that period forward slash indicates that we're in the project route or doing it relative to the directory we're currently in, if that makes sense. So again, we could do LS, or we could do LS period forward slash, we get the same output. And so that's what the period forward slash means in that output using the dash R flag. We can also combine flags, we could say LS dash F R. And this will add the forward slash the end of our directory names, as well as give us this recursive output of what our directory structure looks like. So the final thing I want to tell you about, and I worry about giving you this power is the power to delete things, right? And so we have the ability to delete directories as well as files. So again, with great power comes great responsibility. There are people that have, I've probably even done this myself, that probably everybody has run into this problem, where you might accidentally delete an entire project, or your thesis, or very important documents, when you delete something in the command line, it doesn't go to the trash, it is gone. And so there's really no, there's no coming back from, from using these commands. So you want to be really careful. In the next code club episode, we'll talk about using version control. And one of the benefits of tracking your project with version control is that if you do it right, you'll always have a backup of where you currently are. So if I want to remove a directory, I can do rmdir, and perhaps I'll do data raw. So I'm not going to have any raw data. I don't know what that would be. And so what that then does is remove data raw. And if I do ls-fr, I see now that my data directory no longer has that raw directory. So we could, but I want that back. So I'll do mkdir data four slash raw. And it's back. Okay. So rmd will delete a directory. Now the thing that perhaps as a saving grace is that directory, if you use it the way I've described rmdir, it has to be empty for you to remove it. So to empty a directory to remove a file, we can use rm short for remove. And I could say, let's remove the license. And again, if I do ls-f, I see that I no longer have the license file in my project root. Now, of course, I want that there. So I'll do touch license dot md. And again, at a future code club episode, we'll fill in the contents of that license. Okay. So again, if I do ls-fr, I can see where I'm currently at. And I see that even though I deleted them, I brought them back and we're good to go. All right. So now I've got three exercises as usual for you to engage with. Go ahead and pause the video. And what I'd like you to do is work through these exercises. And once you've finished, go ahead and release the pause button. And I'll come back and show you how I did these different steps. So the first exercise is to go ahead and quit out of the command line interface that you've been working in. You can do this by typing exit at the prompt. And that should close your window out. Restart your command line interface and go back to your project root. Prove to yourself that you can get back to where you need to be to be doing all this great analysis that we have in store. The second is I'm asking you to put a readme.md file in every sub directory of the project. Use ls with the right flags to confirm that you got all of the readme files where they need to go and do not use the cd function. Finally, I'd like you to think ahead about what questions you might want to answer using this data set from the RNDB. This question of inter and introgenomic variation, what would you want to know? Or what would you want to ask as you think about how you would apply these or think about these ideas? For the third question, what I'd like you to do is think ahead about what questions you might want to answer with the data we're going to download from the RNDB, specifically how they relate to our thinking about the use of Amplicon sequence variants. For the first exercise, I want to be able to get back to my project root directory. I've reopened the terminal. I'm at my home directory. I know it's my home directory because I have that tilde sign. It's ready for me to give its instructions because I'm at the dollar sign prompt. I remember that it was in my documents directory. I can do cd documents. Perhaps I forget what it was called. You could type ls and I look through all my different directories here and I see that, sure enough, there's a directory called Schloss RND, RND analysis. I'll do cd Schloss. Again, if I hit tab after just starting to type the first few letters, it will auto-complete the rest of the directory for me. Hit enter and ls and voila, I'm right back to where I want to be. For the second question, I asked you all to put a readme.md file in all of the directories of the project. So if I do ls-fr to get a full sense of the directory structure, I can see that the code directory does not, that these are the directories basically, that do not have a readme file. So what I want to do is I want to do touch code forward slash readme.md touch and I'm going to repeat this and what I'd like to do is I like to hit the up arrow to get back to my previous commands, raw and references. I can also do touch exploratory readme.md touch submission readme. So again, now if I do ls-fr, I should see a readme file in each of my directories. And so again, I could have written this as one touch command where I just space out the different file names, the different directories where I'm going to put readme. But this works also to kind of keep things separated. But again, we've achieved the goal of putting our readme file and it's an empty readme file in each of the directories of our project. I guess I don't have a readme file in data. So I'm going to go ahead and do that. So touch data readme.md. And so now if I do fr, yeah, so I do have a readme file here in my data directory. For the third exercise, I asked you to think ahead about the types of questions we might want to answer with the data from the RNDB. And so general question I might ask is that if we assume that the data in the RNDB represents some fictitious microbial community, how many different taxa would I see if I looked at the kingdom level, phylum class, order family, genus species, an ASV level? How many and how do ASVs correlate, if you will, with those different taxonomic levels? Are there many more ASVs than strains or species? Or are they comparable? The second question that I might want to ask is whether or not ASVs are specific to a strain. So if I see a set of ASVs, are they particular to a strain of bacteria? Or are they particular to a species of bacteria? Or are they particular to a genus of bacteria or higher up, right? So kind of getting at the resolution and asking what is the resolution of an ASV and where do we run into problems of lumping versus splitting one genome into multiple ASVs or one species into multiple ASVs? And then also for both of these questions, I might ask how does the answer change if I'm looking at a full-length 16S sequence or a subregion of the 16S gene, like we typically do when we're doing like aluminum IC sequencing? Similarly, I might ask, well, what OTU threshold would I use if I want a limit splitting of a species or strain or species into different OTUs? And we can again look at different taxonomic levels there. And I might also ask, because we've got archaeal data here, how the copy number varies between archaea and bacteria. Do archaea have as many copies as bacteria do? Or are they much fewer? And perhaps we don't need to worry about this so much if we're thinking about archaea over back here. It may not feel like it, but we've actually done quite a bit today. We introduced a new project. We talked about project organization and the different principles you want to keep in mind as you go through the kind of keeping track of your project and keeping it organized. Organization is not something we do merely on the first day, but it's something we're going to be developing as we go through our project. Thanks again for joining me for this week's episode of Code Club. Be sure that you spend time going through the exercises on your own to help reinforce your new skills. Even better would be for you to take the ideas we've worked through today and think about how they relate to your current projects. I'd love to see what you did. Please feel free to drop a line in the comments below to tell us how you organize your projects and what you like about your approach. Also, please let me know what types of data analysis questions you have and I'll do my best to answer them in a future Code Club. Be sure to tell your friends about Code Club and to like this video, subscribe to the channel, and click on the bell so you know when the next Code Club video drops. Keep practicing and we'll see you next time for another episode of Code Club.