 I believe you can answer your own data analysis questions. Do you? You should. Stick around for another episode of Code Club. I'm your host, Pat Schloss, and it's my goal to help you to grow in confidence, to ask and answer questions about the world around us using data. Former US President and Army General, Dwight Eisenhower once said, plans are worthless, but planning is everything. That is especially true for research projects. If we don't have a plan for getting somewhere, then we're destined to get nowhere. I hate working in the midst of chaos, can you tell? We've spent the last two episodes getting our project organized and we haven't touched any data or code yet. If you have ever taken over a project from a former member of your research group, then you'll know the frustration of trying to find things and make sense of what they did. Your first step would be to get the project organized, right? Well, today we'll also talk a bit more about organization and how we can use issues in GitHub to give ourselves a to-do list and use branches in Git to systematically work through those issues. This is loosely similar to what is called GitFlow or GitHubFlow by software engineers. We'll use GitHubFlow to finally download some data that we'll be using to determine to what degree inter and intra-egenomic variation limit the ability to interpret Amplicon sequence branch, also called ASVs, as a biologically coherent entity that the proponents of ASVs claim. As we go along, we'll see the command line and Git commands that we saw in the last two episodes so we can continue to practice those skills while learning new Git tools. Please take the time to watch today's episode. Follow along at your computer and attempt the exercises. Don't worry if you're not sure how to solve the exercises. At the end, I'll provide solution. In the notes below is a link to a blog post that accompanies this video. That blog post includes reference notes and links that are intended to be a supplement to this video. Please don't forget to Like subscribe to시� fireshawse.com and repost issues for our чrated XXXXX underscore 2020 I recall that future journal title. Who knows where this is going to go. I'd like to think that we'd get it published at some point, but we'll see. So hopefully this looks familiar from our last Code Club episode, where if you went through all the exercises, we wound up getting everything that was in our project directory, pushed up into GitHub. So what we're going to work on today is something that's loosely called GitHub flow. Git flow is another name for it. And I think that kind of a distinction without a difference. And again, these are ideas that are stolen from software engineering and that we're adapting for our purposes with data analysis. So if you look at the top of our repository here on GitHub, you'll see that the second tab is called issues. So if I click on issues, this opens a window that contains what's called an issue tracker. And so issue trackers are used on projects to keep track of different things that come up. Somebody might write in an issue and say it'd be really cool if you added this feature, or I'm running into this bug when I do this, how can we fix it? We're going to use issues as a to do list of things that we want to do as we go along in our project. After we get some issues populated into the issue tracker, we'll go back to the command line and see how we can systematically work through those using again this process of GitHub flow using get as well as some of the tools we've already talked about so far and working on this project. Okay. So we don't have any issues. The first thing we want to do to open issue is to click on this green button for new issue. And there's a window here for title and for us to write in a comment. And so these kind of create a series of threads. And so each issue becomes a thread and so I can put in my initial ideas for an issue. My collaborator might comment on it. I might come back and comment on the thread, adding different bits of information as we go along. And so the first issue that I want to add in here is get files from the RRN DB. Okay, so that's going to be the title of my issue. Now, I'm going to go back and Google. Well, I have it here in my my history because we were looking at it in a previous episode. And I will come back here and I will say the RRN has downloads. And so if I go over to downloads, there's a tab for download. Maybe this would be actually the better link to give. And so I'm going to go ahead and put in that we can use for a project. I don't know what all of the files are. But let's get all of them and put them into data references. So this back ticks that I'm using is a special type of thing to format this as code or as a directory in this case. And so there's these four files. And so I'm going to go ahead and copy the links to all four of them here. The RDP. Like I said, I'm not totally sure what those two are. I think it's some information about the taxonomy. It might be kind of the number of copies that each genome has of the different copies of the ribosomal RNA operon. And so this third file is the fast day file. And then this fourth file. I'm not really sure what that is. I think it might be kind of a database of sorts. And so these are the files. We're going to go ahead and put them, like I said, into data references can click on this preview tabs. This is what it looks like. This is some syntax highlighting using a language called markdown. I don't know if it's exactly a language. It's a markdown way of formatting text in a pretty simple way. Another thing that we can do is we can turn this into this downloads into a link by doing an open and close brace, followed by the URL in round parentheses, we can preview it here and see what it looks like. We can also make a list with a star at the beginning of each line and preview that and you see it's a bulleted list. So I think that's enough information for right now. I'm going to go ahead and save this as a new issue. And so now we have a first issue get files from RNDB number one, but click on issues. I see I've got a table that will be populated by all the other issues that I obtain all the other issues I generate. So the next issue will be the get the Silva reference files. And so we're going to need the Silva reference alignment to align our sequences from the RNDB so that we can make sure we're looking at a consistent region within the gene. So the 16 S gene has these nine variable regions, we could certainly look over the full length of the gene. But unless you're doing something pretty special, like some pack bio sequencing approaches, which don't generate a ton of data, you're probably not looking at the full length thing, you're probably looking at something like v four v three four v four five, which are different variable regions within the 16 S gene, and those are about 250 nucleotides. So if we align our different operons from these genomes, we can then easily more easily extract those different variable regions to then look at how redundant they are, or how unique they are within and between different genomes. Okay. And so we can get this from the mother wiki at. And so let me go to mother.org. And if I go to, let's see, I think it's going to be forward slash wiki forward slash Silva reference files. What we want is this recreated seed database from release 138. Okay, so I'm going to go ahead and put another URL in here for the mother wiki. We would like the recreated seed database from release 138. And that's going to be let's get this link. Not that one. It's going to be this. So I'll copy that link and put it in here in my issue tracker and look at it just to make sure it looks good. Okay. And so where we want to put this, we're going to want to put this into data references. And I'm going to submit a new issue. And you know, now that I think about it, I think I may have said I wanted to put the R and DB also into data references. It's not where I want it. I want this to go into data raw. So I'll say on second thought. Let's put these files into data raw. So they're not really raw files. We could have created a directory like data R and DB, but I think data raw would be much better than data references. These are going to be the raw files that we're going to work with we're going to do some manipulation on them, and then do something else do other things with them right. And so I think data raw is probably a good directory to put them in. So I'll go ahead and add that as a comment. And so now this thread, as you can see has two comments on it, they're both from me right so these are notes to myself. So it's in a way having these links here is really nice because then I don't have to go back through all the different websites. I've got the links here and say I come back to this in a couple weeks. I'd have to go back and find those other websites but I've got them right here which makes life much easier. Okay, let's create another issue. And so I want to install a mother into my project. So we want to get a mother for aligning sequences and perhaps other functions throughout the project should install into code. Sorry, code forward slash mother. So I don't want it to I don't want I want it to be a directory within code. Also, we don't want to track this with version control. So we should ignore it. So there's really no reason to track it because it's something that we're installing we're not going to be manipulating it or anything. Because of this, though, we should leave installation instructions in code. Read me that MD and we should indicate the version of mother in a list of dependencies in the main read me file. So we can get version 1.44.1 at or let's just say the latest version, latest version of mother is 1.44.1. And we'll go ahead and create this as a link also. And we can go to mother is on GitHub. So we'll do GitHub.com forward slash mother forward slash mother. And you'll see that there are releases down here on the bottom right of my window at least. And so if I click on this latest, this is the release we want. And you'll see down here that there's different versions of mother that can be installed. So I'm going to go ahead and copy a link to this page, and then put that in here in my spot for the link. And I'll preview it. And that looks good. And so we'll submit a new issue. Great. So now we have three issues. I'm going to create a fourth issue. Don't worry, we won't go through all of these issues today. The fourth issue is going to be to align the RRNDB fast A sequences file to silver seed reference. And so to do this, we'll use mother's align dot seeks function to align the sequences in the to the the silver reference alignment. Okay, very good. And I think that's it for now, we might come back and add other things to these issues later on. Very good. So if we look at our issues, we now see that we have four issues in our issue tracker. And again, these are very helpful, because it helps me to organize what's going on. Again, if I walk away from this project for a few weeks, I can come back and see what the issues were. I can remind myself the conversations that we had. And so the first issue that I'm going to work on is issue number one. Now we wouldn't necessarily always go in numerical order. But it's the beginning of the project, I kind of have a good idea of where things are going. And so I want to start with issue one. And so again, what we want to do is we want to download these four files, we want to decompress them and download them into data raw. Okay. And so this is issue one. And so again, what we're in the midst of doing is something called GitHub flow. I'm in my project working directory, I can do get status and see that I'm on branch master. Everything is up to date with the GitHub hosted version of the repository. Nothing to commit the tree is clean. We're good to go for starting our day. So to do GitHub flow. The first thing we're going to want to do is create a branch. Now you can think of the master branch that we're on as like the big trunk of the project. And things that bifurcate off of that are different branches, right? And I guess not most trees don't do this, but eventually that branch will feed back into the main trunk of the tree. So we can look at the types of branches that we already have by saying get branch. And we see that there's only one branch, it's master, it's green, in this case, and there's a star which tells us we're on the master branch. I can create another branch, also with the get branch command by doing get branch, and I'll call it issue underscore one. Now I can do get branch. And so they have two branches, but I'm still on the master branch, right? So now what I can do is do get checkout issue one. Okay, so I'm going to check out the issue one branch. It tells me I switched to branch issue one. I can confirm this by doing get branch. Can also do get status and see that both of these tell me I'm on the branch issue one. I like to name my branches for the issue I'm working on. That way it's easy for me to go back and forth to the issue tracker. So what we want to do again is to get those files from the RRNDB. And so now that we're on this branch for issue one, we want to do all the work for issue one. We're going to commit those changes, and then we're going to merge it back into the master branch. So we'll see how this is done. So the next thing we need to do is we need to go back to the RRNDB. And I need to download these four files to my desktop. So I'm going to go ahead and click on these. And my browser will automatically download these for me. It's also trying to open these for me. It's very kind that way. That FASTA file it'll download. And this TSV it'll download. So these are stored up on the database on the website as zipped files. And it'll download them. And at least on a Mac it'll automatically decompress them. On a Windows computer you might need to unzip them yourself. So if I look at my desktop here in my Finder, I see these four files called RRNDB. What I want to do is move those into my project directory. So I can drag these over to documents, and then up to Schloss RRN there, and then data. And this is going to go into raw. So I can double check that everything's where it's supposed to be by doing ls data forward slash raw. And I see that I now have these four RRNDB files there. That's version 5.6 I see. And so what I will now do is I will update my readme file. And I will say nano data forward slash raw readme.md. And what I'm going to put in here is obtained files from the RRNDB located at, and then I'm going to come back to my downloads page. These are files from version 5.6 released in 2019. So again, this is just a little bit of documentation so that when I come back later and wonder, you know, where did these files come from, it's easy to figure that out. So I'll save it with Ctrl O, hit enter to save it, and Ctrl X to bop out. If I do get status, I see now that I have modified my data raw readme file, but I now have these four untracked files. So these four files are quite large. If I look at them in my finder window, I see that some of them are one of them, at least the fast day files, 129 megabytes. So that's larger than what you can actually store in a GitHub hosted repository. They limit you to 100 megabytes per file or one gigabyte for the whole repository. So that's too big. So what we're going to do is we're going to tell Git to ignore the files in data raw, and we can do that using a file called .gitignore. So we mentioned this yesterday in the last episode, ls-a shows you the hidden files, and so we see we have a directory called .git. This ds-store is something that Mac throws in, and we'll want to ignore that as well, although it seems that it's not noticing it anyway. So we'll do nano space period gitignore. So the first character of the file is a period, and then it's gitignore. And then in this directory, I'm sorry, in this file, we tell Git which files to ignore. So I'm going to say data forward slash raw. And you know what, actually, I'm just going to say data. I want to ignore everything that goes into data because these files are going to be big. There's a way around to get exceptions or to force it to track certain files like the readme files, which we'll want to update. But for now, we're going to ignore everything in the data directory, so we don't have to worry about accidentally committing really large files to our repository. So control O, control X. If I do get status, now I see it's no longer saying that those files in data raw are untracked, but it also tells us we have an untracked file called .gitignore. So I'm going to go ahead and add .gitignore to my repository, as well as the changes from data raw readme. So I can do get add period gitignore. I see that now is a new file, that's staged to be committed. I'm going to go ahead and stage the changes to my data raw readme file. And so it says the following paths are ignored by your .gitignore files data. So use dash f if you really want to add them. So I really do want to change this readme file. I really want to add the changes there. So I'll do get add hyphen f data raw readme. No more warnings or error messages. Get status. I now see that I've staged the new gitignore file, as well as the changes to data raw readme. And I can do get commit dash m. I can then say download rrndb files. Now I can do something kind of clever here that there's a really nice tie in between git and github, that if github sees a message in one of my commits that says something like closes number one, then it will automatically close that issue for me. So let's do that. We can do closes number one. And so we see our working tree is clean. One branch issue one, we're in good shape. Now what we want to do is merge this change back into our master branch. So we previously used git checkout issue one. Well, now we're going to do git checkout master. And so now I'm switched to branch master. And if I do git branch, I can see my two branches. And I can now merge issue one into master using git merge. So I'll do git merge issue underscore one. We see that it brings in the dot gitignore file as well as the changes to my data raw readme file. And so now this is in my master branch. And so I could do nano data raw readme. I'm on the master branch and it's in here. Very good. So let's push this up to github, up to our repository. So we can then do git push. It pushes it up. And I will now look at my issue tracker for issue one. And I see already that it says it's been closed. So this has taken us through the process of git flow. We have an issue. We claim the issue by creating a branch in our local repository. And again, my preference is to name the branch after the issue I'm working on. I can use that issue tracker to update with different bits of information I want to have. But then I can come back in my branch. I then move to my branch by using git checkout issue one or whatever the issue is. I make my changes in that branch to the code. And then I commit the changes on that branch. I then go back to my master branch and merge the issue back over to master. I can then push that up to github and we're in good shape closing out the issue. Now this might seem like a lot of extra work and to some cases it might be a little bit of overkill. But the nice thing about it is that if I keep these branches with my repository, I can go back and see what happened for each of those different branches. Some of the branches I might decide I don't want to fold into master because it was a pathway of analysis that just wasn't worth pursuing. Alternatively, I might be working on one issue and I might decide, well, I need to put that to the side for right now so I can go work on this other issue. So I can work on multiple things at the same time without screwing up like my master branch. And so that's one of the great benefits of doing that. As always, I have a few exercises for you to work on now to practice the skills that we just went over in this episode of Code Club. The first exercise asks you to create a new issue. As we go through, it'd be nice to keep track of different articles and blog posts or tweets or whatever resources that describe the use of oligotypes or exact sequence variants or Amplicon sequence variants so that if we go to write this up or to do some type of project report, we'll have all that information together. So I'd like to have an issue where we're keeping a thread of different resources. In the second exercise, I want you to resolve issue two using GitHub Flow. And in exercise three, I want you to use GitHub Flow to resolve issue three. So go ahead and pause the video now, work through these three exercises, and when you're done, come back and I'll share with you how I would go about doing them. I hope you enjoyed working through these exercises and that they really helped you to learn the material better. In the first exercise, I asked you to start putting together a thread in an issue that will accumulate different reference materials related to the topics of oligotypes, Amplicon sequence variants, and exact sequence variants. So I'm here in our issue tracker for the repository. I'll go ahead and click new issue and I'll say generate resource list related to ASVs. And so need to create a bibliography of resources related to oligotypes, ASVs, and ESVs, and other topics, related topics. And so I'll get this started. But in the previous Code Club episode, I created, I had a number of links in here, and so what I'm going to do is I'm going to go ahead and grab a bunch of these links. So I'll copy this link. This is Sue Hughes talking about SLP. I'll get this other one from Murat Aran talking about oligotyping. Let's see. There's this article from Callahan that really strongly argued for ESVs or ASVs. And Noah Feuer has a couple articles, a blog post that I'll include. So this one is, I'll say a Feuer heterogeneity. And then the next one is Feuer as well. And this, because lumping and splitting, so I'll just call it lumping, splitting, and so forth. And I think there was one more by Sun, if I recall, from 2013 that also looked at heterogeneity. So again, this is the start of a list that we can accumulate as we go through the project and as we come across different resources. So I'll go ahead and submit that issue, and we can add to this as we go through. Great. So let's exercise one. So the next issue was to get silver reference files. So let me look at this. This is the link to the file that we want to get. And we're going to go ahead and put this in data references. So I'll go ahead and download this. And if I come to my desktop, I see it downloads it as silverseedv138.tar. That's still, it's not compressed, but it's it's tarred together. It's bound together. So I can open that up. And looking at the contents here, I'm going to go ahead and not going to take the read me with me. I'll go ahead and copy these two files, the silverseedv138 align, silverseedv138.tax. And I'm going to go ahead and take this into my documents, into my project directory, into data, and then into references. Okay. And so now if I look in references, I see my silverseedv138 align, silverseedv138.tax. And what I'd like to do now is to make sure they're there in my repository. So we'll do data references. I'm not ready to do that. I need to create my branch. So I will do get branch, just to make sure I'm on master. And then I'll do get branch issue two. I will then do get checkout issue two. So I've switched to branch issue two. Again, I can double check that I'm on issue two. If I do LS data references, I see that I've got my two silverseed files here, the fast day file and the taxonomy. It's a fast day formatted alignment file and the taxonomy file. I'm going to go ahead and do nano data references, read me. And I can say downloaded silverseedv138 seed file for alignment and taxonomy from. And looking back at my issue tracker, I'm going to copy this link into here. And that's good. I'll go ahead and save that and exit out. If I do get status, I see that I've modified data references, read me. Because I previously did ignore data in the dot get ignore file, it's not saving those. But to track these changes, I'm going to have to use that get add dash f. So get add dash f data references, read me. And so this change is ready to be committed. And I can do get commit download silver recreated seed reference closes number two. Again, saying closes and then the number of the of the issue, when I go ahead and then push that up to GitHub, it'll automatically close the issue for me. So now I've been on my issue branch, I need to now move over to my master branch. And I'll need to merge back in my issue branch. So get merge, I'm sorry, get merge issue two. We now see the changes here. Get status. And we're ahead of our origin master branch, the remote repository by one commit. We can go ahead and do get push pushes our local repository up to GitHub, refresh this and notice that this issue is now closed. And so now if I go back to my main issues page, I see that I now have three remaining issues and install mother is our next issue that we want to take on. So this third issue that we want to work on is install mother, and we want to get mother so that we can align those fast that fast day file from the R&DB against our silver seed. We'll probably do that in the next episode or episode after that. So we don't want to track it with version control. So we're going to put it in a directory called code forward slash mother. And we won't track it there. Because again, we're not going to be making changes to mother. We'll be updating from there, but we do want to indicate where we got it from. And the version enlisted as a dependency in our main read me file. So we're going to change a couple files here. And so in my repository. Again, I'm on my master branch, get branch issue three, get checkout issue three, and we're on branch three. And so going back to my issue tracker, I'm going to go ahead and click on this link to come to the GitHub page for mother. And I'm working on a Mac. And so I want this Mac OS X hyphen 10.14 point zip. If you're using a Windows, then you want to use Windows if you're using Linux, you want to use Linux. I'll go ahead and download the Mac version of mother by clicking on clicking on that link. It downloads. And again, because it's a Mac, the Mac will automatically uncompress it decompress it for me. If you're on Windows, you'll probably have to do that yourself. And if I go to my home directory. Sorry, my desktop directory on my home. I see that I have this mother directory. And in it is everything I want. So what I'm going to do is I'm going to grab this mother directory and drag it to documents, Schloss mother, and then into code. So that now in my code directory, I have a directory for mother. Okay. And that has everything that I need to run mother in the future. So if I come back now to my repository, and I do LS code forward slash, I see that I have a mother directory. And if I look in code mother, I see everything I had just seen. So that's great. I need to update the read me file for my code directory. So I'll do nano code read me. And I'll say downloaded version 1.4.1 of mother from and I will grab it here. Okay. And I would like to update my main main read me file. So I'll do nano read me. And down here at the bottom, I'll say dependencies. And I'll say mother version 1.44.1. And that will be good. So I'll go ahead and save this. And now if I do get status, I see that I've modified my main read me, I've modified my code read me file. But they also have put files into code mother, but I don't want to track that. So I'm going to ignore this directory and get ignore. So if I do nano get ignore. Now, below data, I'll go ahead and do code forward slash mother. Save it, drop out, quit, get status. I now see that it's no longer looking for looking at code mother as being untracked, but it now comments that get ignored has been modified. So I'm going to add these three files to the staging area and then commit them. So I'll do get add dot get ignore. Read me code read me. Those are staged ready to be committed. Get commit dash M. And I'll then say, install mother to code directory. So again, I'm on my issue three branch. I need to now go back to my master branch with get checkout. I can then do get merge issue three. We see the changes that it's made to master, and I can then do get push. And you know what, I forgot to add to my commit message that that closed issue three. So if I come back now to my issue tracker, install mother. So if I forgotten to add the closes issue three, what I can do is I can leave a message in here indicating where I closed that issue. And so I can figure that out. Back here in my terminal, when I ran get commit, it put this goofy code issue three EO E05 C FDA. So that's a identifier for the commit. So I'm going to go ahead and copy that, and then say closes closed in that commit. And you'll see that it automatically linked in the commit message that I had. So if you forget to put closes issue three, I do this all the time. That was not a intentional mistake that happens regularly. You can go ahead and grab that commit identifier. It's called a Shaw SHA. We'll talk about that more later. You can say closed in that preview. You can see that it's hyperlinked back to the commit. And then I can click close and comment. And so it'll load the comment and it will close the issue. Okay. So we've worked through three iterations of get flow. We've filed issues. We've done quite a bit today. And we've even gotten data that we can use to press on with our project. Thanks again for joining me for this week's episode of Code Club. Be sure that you spend time going through the exercises on your own to help reinforce the new skills. Even better would be for you to take the ideas we've worked through today and think about how they relate to your current projects. I'd love to see how you are adapting what I have covered in this and other Code Club episodes into your own work. Also, please let me know what types of data analysis questions you have, and I'll do my best to answer them in a future episode of Code Club. Finally, please be sure to tell your friends about Code Club. And to like this video, subscribe to the Riffamonus channel and click on the bell so you know when the next Code Club episode drops. Keep practicing and we'll see you next time for another episode of Code Club.