 I believe you can answer your own data analysis questions. Do you? You should. If you do, stick around for another episode of Code Club. I'm your host, Pat Schloss, and it's my goal to help you to grow in confidence to ask and answer questions about the world around us using data. In the last episode of Code Club, we started a new project. It was a very exciting day. The project we're going to be working on over the next couple Code Club episodes seeks to determine how inter- and intra-genomic variation limits the interpretation of Amplicon Sequence Rants, or ASVs. As I mentioned, this is a newish approach to analyzing 16S-RRNA gene sequence data that's all the rage, but I feel that not enough people are questioning the assumptions that are baked into the method. Over the course of the next Code Club episodes, we're going to test the assumption that ASVs represent a biologically coherent entity. Along the way, we'll learn different elements of what it takes to make an analysis reproducible and how to use a variety of tools that will help out. Even if you don't find the problem we're studying interesting, you'll hopefully find the approach generalizable to a variety of problems that do interest you. Last week, we got familiar with our command line interface using a UNIX style environment and set up a basic organization structure for the project. For today's episode, we'll add version control to our project and learn to use Git and Git Hub to track the history of our project. Git is one of those things that can be very confusing, but it doesn't need to be. We'll take it slow and we'll be using Git in nearly every episode that follows, so we'll get a lot of practice. Please take the time to watch today's episode all the way through. Be sure to follow along on your computer and attempt the exercises. Don't worry if you aren't sure how to solve the exercises at the end, I'll provide the solutions. And the notes to this video is a link to a blog post with reference notes and links to supplement the material in the video. Before we get going, it would be great if you could help me out by liking and subscribing to the Ruffimona's channel, hit the bell, so that you know when the next episode is released. If you're like me, use the undo function in Microsoft Word or Excel countless times a day. Sometimes I need to undo a large number of things to get back to a point before I screwed things up. One problem with the undo function is that if I quit the program, the threat of changes is broken and I can't undo something from a previous session in the program. If we use it right, version control is a lot like the undo function. The big difference, however, is that version control won't lose the threat of changes when we quit an application. We'll also keep track of my changes across many files in a project. Even better, it works well with text files like those we'll be using for our project. Here's a brief list of things that version control will help us with. Version control provides a special type of documentation for our project. This will tell us when, why, and how I changed the project. Version control is like a backup system. I can delete a file or even my entire project directory and get it back. Version control facilitates making your work public. I can use version control to share my code with others. Finally, version control enables collaboration. Other people can make suggestions to improve my project. To achieve these benefits, we'll use the most popular version control tool called Git, G-I-T. We'll be running Git on our computer to maintain what we call our local repository or local. Note that even if you're doing a project on a computer that's not in the same room as the computer you're typing on, I'll still call that your local repository. We'll also make use of GitHub to host our repository, as a remote repository or remote. Let's go back to our project directory and see how we can use Git and GitHub to facilitate reproducibility in our data analysis. Go ahead and open up your command line interface. Again, if you're using a Mac, that's going to be the terminal application. If you're on Windows 10, hopefully you've installed the Ubuntu client so that you can use their command line interface. Whether you're on Windows or Mac, it shouldn't make a difference as long as you have those tools installed. The key is that you're going to have a terminal window like this. Again, some of the styling might be a little bit different. The font might be a different color. The background might be a different color. That really doesn't matter. Ideally, you would have this dollar sign as a prompt followed by this gray rectangle that is awaiting your input. Okay, so a way that we can test to make sure that we've got everything we need is to type Git hyphen hyphen version. So Git, G-I-T, is the version control software that we're going to use. And this flag, the dash dash version will tell us what version of Git we have. I happen to have 2.26.0. The version you have probably doesn't really matter. I don't know that I've ever felt like I was kind of left behind because I had an older version of Git at different times. I don't even know what the current version is. But the updates that are made don't seem super relevant for a lot of my practices and probably won't really matter that much for what you're doing, as long as you have a fairly modern computer. Okay, so again, to remind ourselves of what we did last time, we can type LS at the prompt to get an idea of what is in our current directory. We can use PWD to get our current working directory to print the working directory. And we're in our home directory that you can see here with this tilde sign at the prompt. And what we'd like to do before we get going too far is to go ahead and configure our Git environment on our computers. So if you've already done this before, sit tight, maybe you'll learn something new or maybe you'll see that you might have something missing. So we're going to go ahead and type Git config hyphen hyphen global. And then we'll say user dot name. And here we're going to give it our username. And so this would be like your first name, last name. So I'm going to do Pat Schloss. You're not Pat Schloss, I don't think. Go ahead and put in your name. Then we'll do git config hyphen hyphen global user dot email. And so I'll put in my email. Again, you want to put in your email, not my email. Then we want to put in our GitHub user account. So hopefully you've already looked through the notes and you've gone through and installed or set up a GitHub account. If you haven't done this already, I encourage you to kind of look at the notes that are associated with this video to see how to set that up. So git config hyphen hyphen global. GitHub dot user and minus P Schloss. And the next thing that we're going to set is how the software, how Git handles your line endings. So unfortunately Mac and Windows uses different line endings. So if you're using Mac, you'll do git config hyphen hyphen global. Core dot auto CRLF. Yep, and if you're on a Mac, you're going to put input. Now, if you're on Windows, instead of input, you'll use true. The next setting we want to set is to tell git our text editor. The default text editor for git is called VI. People that use VI absolutely love it. The rest of us find it a little bit frustrating. It's particularly frustrating to people that have never seen it before and aren't really comfortable with text editors. So we'll use a text editor called nano to set the text editor that we're going to use of nano will do git config hyphen hyphen global. And then we will say core dot editor. Space nano hyphen W. And this says our editor that we're going to use as nano. And we're going to give it the dash W flag, which tells it to wait. While git sends information to nano. OK, so these five commands are going to set up the basics for our git environment. The one thing that you may not have again is that this GitHub username. It's important that you get that. So I would encourage you to pause the video right now. If you don't already have a GitHub user account, go figure that out and then come back and insert that here. Now, this git config and these different parameter names seem a bit confusing. They are to me. I do not have these memorized whenever I go through and set these. I always Google git config file and look for different settings to use. This is not something that needs to be taking up a lot of room in your brain. So one thing we can do is do git config hyphen hyphen global hyphen hyphen list. And so this will then output the contents of our GitHub config file. Now, I have other things in here because I've been using GitHub and get for a long time. There are different things you can create, like things called aliases and other such things. Don't worry about those for right now. What's really important is that you have a username, a user email and that those are yours, not mine, that you have your GitHub user. That you have, where is it down here? Your core editor is nano dash W. Your core auto CRLF is input if you're on a Mac and true if you're on a Windows. And so if you don't have those values, then go ahead and rerun the commands. And if you missed them as I was going through here in the video, they are listed in the notes that accompany this video. So as we've been typing in these git config commands, you've noticed that we've used this dash dash global. And so this will set our git environment globally across our computer. And so theoretically, you could set settings for your individual repositories that we're going to create your individual projects. I don't do that. I mainly only have I really only have this this global set of settings. OK, so git is set up on our computers. Also, I guess I should say that if if you didn't get git to work, then that probably means something was installed wrong. And it probably means you're on Windows. And that probably means you either don't have git bash or you don't have the Ubuntu shell installed. I'd really encourage you to get the Ubuntu shell installed. I have a link in the notes from last week's Code Club episode that you can go back and consult for how to install. All right, so let's go ahead and navigate to our project. And if you'll recall, we put that in our documents directory. So we can do CD documents. And then CD Schloss RRN analysis XXX 2020. And we can look at what is in our directory by typing LS. And we see at this point, we have a license file, a read me file, a code directory, a data directory, exploratory directory and submission directory. So I remembered that those were directories. But of course, we could also also do LS-F that we talked about last week. And this then shows you what is a directory and what's not. We could also do LS-FR to get the recursive to see that we have other directories that within data. We also have a mother directory, a process directory, raw and references directories. And then within all of these directories, we also have a read me.md file. And these are all blank. Okay, so the next thing that we'd like to do is to create a Git repository. So our repository is a collection of information about your project, about the contents of your directory. And so this is going to keep track of all the contents of your directory. It's going to keep track of any changes you make to that directory. It's going to keep track of when you made it. It's going to also keep track of any messages that you write down indicating what you changed. So to do this, I'm in my project directory. And so I know that because if I do PWD, I see that my current working directory is the Schloss Rn Analysis directory. So one of my favorite Git commands to run is Git init space, period. And so again, I only run this a couple of times a year. But what this means is that I'm starting a new project. So I'm initializing a new Git repository. So Git init space, period. And so if I run that, it tells me that it initialized an empty Git repository in my current working directory. And you'll see that the directory it creates is .git. So if I type ls, I don't see that directory .git, right? But if I do ls-a for all, this will show my hidden directories. And so I'll see that .git is in here. So .git is what makes this project now, this directory a repository. If you delete that, it's no longer a repository. So we do not, absolutely do not want to delete .git unless you have a really good reason to do so, which I don't think we do. So we have a repository now. So what do we do with it? Well, the first thing that we can do is we can look at the status of our repository. And so there's probably five commands that I use whenever I'm using Git. So that's going to be Git status, Git add, Git commit, Git push, and Git pull. So status, add, commit, push, pull are the five. If you know those, you'll know 95% of what I know about Git. But really, status, add, commit are the three that I use most frequently. So we'll type Git status. The output gives us a few pieces of information. It tells us it's on branch master. Maybe you don't know what that means quite yet. We have no commits yet. We haven't done anything. There's a bunch of untracked files. There's license.md, readme.md, and then there's these directories, code, data, exploratory, and submission. So nothing added to commit, but untracked files present, use Git add to track. And so the way the repository works in Git is that we have the repository, which again is this database of files that it's keeping track of, the changes that have been made to it, who made the changes, when they were made, messages associated with those changes. It will also have other files that are in your project but aren't yet in the repository. So what it's telling us is that all of these files are untracked. They are not in our repository yet. There is nothing in our repository. What we'll do is we'll add a file to our repository. And what that will do is that will stage the file as it's called. And then we will commit the change, which will then, the commit, will move that change, that in this case, new file, into the repository. First, what I want to do is I want to use this nano function, this nano program, to look at what's in readme.md. We created it last time using the touch command. So we can do nano, readme.md, and this opens up a blank window, right? This is still in our terminal window, and we see a variety of pieces of information, but nothing is here. Okay, so I'm going to leave it empty for now. And to get out of it, we can do control X. It will ask if we want to save, and so I'll say Y to save it, and it'll write it to readme.md, and then it brings me back out. Very good. So now I can do get status to see nothing has changed, and I will do get add readme.md. So this is going to add the readme file that's in the root of my project directory. All right, very good. No warnings or error messages, little type get status. We now see that there's a new section here called changes to be committed, and it tells us there's a new file called readme.md. Very good. So I'm ready to commit this. So I'll say get commit, and now I want to give it a message. Okay, so if I type get commit, what this will do is this will open up nano for me and allow me to type in a message. But I could also do dash m, and then in quotes, I can put a short message. There's a lot of resources out there on what makes for good commit messages in general that boils down to a declarative statement that tells somebody why a change was made, right? So you've got the code to tell you what was changed. The commit message tells you why a change was made. And again, that somebody you're writing this to is most likely going to be you several months from now. So I will say add readme file to repository. Pretty simple. And now if I type get status, I see that I no longer, that file has moved out of changes to be committed, and it's no longer one of my untracked files. Okay, so let's go ahead and modify that file to see how this process plays out over a few iterations. So again, I'll use nano readme.md, and then I will say what should we say? So I will put across the top here code club project assessing whether intra and inter genomic variation, hinder utility of ASVs. And that kind of ran off the screen. So maybe I'll just put in a line break here, like that. Okay, so that's a sentence. This is a readme file describing the project of what's going on. So now I can do control X. Do I want to save? Yes, right to that, and we're good. Okay, so now if I type get status, I see that I instead of having where I had before new file, I now have a modified file. So modified readme.md. And so to commit the change I made, I will then do get commit dash M. And I'll say give title to project. And it's not happy with me. Oh, it's not happy with me because I didn't add it. I did not stage the change. I ran get status and then get commit. That's not going to work. So I want to do get add readme.md get status. I see now that it's been modified and it's green. It says changes to be committed. And again, I can do get commit. I'm going to do my up arrow to get back to that message I had before. Hit enter. Get status. And I see I'm back to being good. Okay, so let's do this one more time. So we'll do nano readme.md and I will say, what should I say? Developed over a series of, I'll say code club episodes led by Pat Schloss to answer an important question in microbiology and develop comfort using tools to develop reproducible research practices. All right. So again, I can save this. Another way to save it besides doing control X would be to do control O and says file name to write. Enter that. And so control O is this down here right out. Again, that's another way of saying save. Again, when you do control X, it brings us out. Again, I do get status. I see that these are changes not staged for commit, including readme. So I need to stage it to stage it. I do get add readme.md get status. It's now listed under changes to be committed. I will then do get commit dash m. And I will then do, what should I do? Give longer description of project. And I'll hit enter. And I see now that one file changed. There are three insertions and I can do get status and see that I no longer have anything staged. There's no changes that are being watched and everything is good. So I have my local repository now. Doesn't have much in it. The only thing in my local repository is this readme file with a couple of lines added to it. I'd like to get that and in the future other things up onto GitHub to my remote repository. Again, what we've been working in is our local repository and we want to move that to our global or not global remote repository on GitHub. So to do that, we'll go ahead and open up our browser and make the window a bit bigger. And we'll go to github.com. And if you have an account, this will bring you into your account page. Again, my page is going to look a bit different than yours. I'm involved in a bunch of different projects. I use get and get have a lot for everything I do. And so what you'll see over here next to repositories is a screen button for new. So let's go ahead and click on that. This allows us then to create a new repository. And the owner is me. And the repository name is going to be Schloss underscore RRN analysis underscore XXXX underscore 2020. So I like to use a repository name on GitHub that matches my project name in my local computer. So let me just double check that I got that right. Schloss are an analysis XXX 2020. Very good. So we can always change this down the road. That'd be good. We could put in a description. So code club project, analyzing utility of ASVs. I'm going to make this public. So depending on how you have your GitHub account set up, you may or may not be able to use private accounts. In the notes for today's video, there's links for if you're an academic, if you're a student or an educator, how you can get an academic use account with GitHub, which would give you unlimited private repositories. Otherwise, if you want private repositories, you have to pay for it. Public repositories are always free. So I'm going to make this public because I want people to see what I'm doing. So I'm going to skip this step if you're importing an existing repository because I have a local repository that I'm going to use. So I'll hit this green button to create a repository and it's going to go ahead and make this repository. And very nicely, it gives me instructions on what to do next. And so I'm going to push an existing repository from the command line. And so I've already got a repository right. And so what I want to do is push it from the command line. And so they have these nice little icons on the right side. If you click it, it'll copy for you. And I can then come over to my terminal and paste. Hit Enter. And what this is doing is this is connecting my local repository to GitHub. And it's then pushing my local repository up to GitHub. So push allows us to push things up. Okay. So we've pushed it up. Now let's go back to GitHub and see what things look like. So I can come over here and refresh my screen. And I see, aha, we have a directory. It's a readme file. And in that is the text we wrote, right? This file, this readme, by the way, will always be here regardless of whether or not we add extra files. But I can click on the link here to see what the contents of the file look like. Okay. So I'm going to click on that to come back to the project route. And we'll do a lot more in GitHub for the next code club as we start to brainstorm how we're going to work through this project. So let's come back to our local repository and let's make another change to our readme file. So I'll do nano readme.md and I will add in here author Pat Schloss. So you go ahead and add your name. It doesn't matter what you type, just type something to modify this file. And again, you can do control X, save or control O and then control X. I seem to always do control O and then control X. And now we're going to go back through our steps. I do get status, get add, get status, get commit. That's the way I do it. I find that's the safest way to do it. There are ways to do it all in one step. I find that when people do it all in one step, then they tend to commit things to the repository that they might not want to commit to the repository. In the future, we'll talk about how git actually doesn't do very well with really big files. And so if you had a movie, you wouldn't want to put that into GitHub. If you have a really large data file, you don't want to put that into GitHub. So again, breaking this down, get status. We see we have changes not staged for commit, the modified readme file. I can then get add readme.md, get status. I see that it's now a change to be committed and I can do git commit-m add author name. And again, if I do get status, I see it's gone. I see everything is good to go. And as you notice, as we go through these steps, by doing it stepwise, instead of all of one step, what you'll see is that it gives you opportunities to discard changes. So if I had done git restore readme.md, that would have discarded what I added. And if we've got to this step, it said, you can use this to unstage it and to undo it. And so again, if you take it piecemeal, step by step, that's the safest way and least frustrating way to use git I have found. All right, so we've committed that. Our repository is up to date. This tells us in the final git status that our branch, where we currently are, is ahead of origin master, which is GitHub by one commit. So we'd like to push that commit up to GitHub. So the safest thing to do is to first do git pull. And what this will do is this will take, git pull will take your remote repository, bring it down to your local computer, your local repository, and it will then merge any changes that have happened. And so that's the safest thing to do because perhaps you've forgotten changes that you made. You can actually make changes on GitHub or other people could make changes and commit them to GitHub if you're working as a team. And so you want to pull those changes in before you push your changes up. So we can then do git push. We then see output that it has pushed up a variety of changes that we made. We can then come back to our repository, our remote. It tells us that we made a commit two minutes ago. If I then hit refresh, I see now that author patch loss is in my readme file. So very good. So this is the process that we go through to add files, to make changes to files, and to keep track of that history. And those little messages we write are kind of like love notes that we write to future us. And as you can see, that shows up here as kind of a description of what was recently done to the file and when it was made. And so the GitHub interface is very nice for looking at what's going on over the history of a project. And it tells us we've made four commits. So it's very handy. And we'll see how to do some of these things in the terminal and the command line interface as we go through future Code Club episodes. At this point, what I'd like you to do is to pause the video after I described three exercises that I'd like you to work through. So the first exercise that I'd like you to do is to add more files and actually the rest of the organization structure to our GitHub repository. So as an exercise in the previous Code Club episode, you made readme files that went into all of the directories. So I'd like you to add all those readme files to the repository. You don't have to add any text to them at this point, but go ahead and add them to the GitHub repository. You can do this in one step or many steps, whatever you feel most comfortable with. The idea is to get practice with it. Then once you've committed the changes to your repository, go ahead and push the changes up to your GitHub account. Next, I'd like to use the MIT license for this project. Our license file is currently empty. And if a license isn't provided, then it's assumed that it has a very restrictive copyright. We want to be permissive. We want other people to use our code to build upon it, but we also want them to give us attribution. So the MIT license does that for code projects. And so I have a link here on the slide, opensource.org, forward slash licenses, forward slash MIT. And so I want you to copy the text of that license into your license.md file. Don't make any changes to the file. Save it and commit the changes to your repository. And the third exercise, what I want you to do is go into that license file and indicate the year and your name as the copyright holder. I'm actually not sure who the copyright holder is if you're copying everything I do. Again, we won't worry about that. Go ahead and then commit those, add those changes, commit the changes, and then push the changes up to your GitHub account. And then look at your GitHub account to see how the project is becoming fleshed out. Pause the video and we'll come back. Once you unpause the video, and I'll show you how I worked through those solutions. So what I'd like to do is add all those other readme files to my repository and then push that up to GitHub. So again, if I do get status, this will give me a sense of what's there. And so I can do get add code forward slash readme. I can add multiple files on a single line. I could do data, mother, or I guess I had a readme there, right? Data, readme. I had data, mother, readme. Data, raw, readme. Data, processed, readme. Data, references, readme. And I run that. And so I have these files that are changes to be committed, they're new files, but I've forgotten to also add exploratory and submission. So I can add those in a second add function. So I can do get add exploratory, readme, and then submission, readme. All right, very good. So we've got all those readme files staged, and I can then do get commit and say add readme files across project organization. Very good. And I see that the only untracked file I have now is my license.md. My branch is ahead of Origin Master by one commit. Nothing is staged. Nothing has been changed. We're good to go. If I go ahead and do get poll, be safe. Everything's up to date. Get push, pushes these changes to the repository. If I look and refresh my GitHub page, I now see that I have the directory structure for code is there. If I look at data, I see I have subdirectories for that as well. And things are looking pretty good. Very good. So if you got that, good job. You have the basic workflow that again serves 95% of what we're going to be using with Git. So the next thing we want to do is add a license. And again, if we go to opensource.org, forward slash licenses, MIT, this brings us to this page, make the font a little bit bigger. And so this text between the copyright and the word software is the license. So I'm going to go ahead and copy this. And then I'm going to open up my license file. So nano license.md. And I can then paste here. And the line wrapping here is a bit funky. Don't worry about that. We'll go ahead and save it and quit. Get status. We see that license.md is untracked. I will go ahead and get add license.md. We see that it is now, the changes are ready to be committed. It's staged. And I will say put project under MIT license. Get status. And something that occurred to me is that we didn't change the date or the name. That was exercise three. So I'll go back into nano license.md. And I'll put in 2020 Patrick D Schloss. That's me as the copyright holder. I'll save and quit. But you get status. I see now that we have changes not staged for commit, but we have a modified file. I can do get add license.md. Get status. It's ready to be committed. And then I will say add year and copyright holder and get status. One of the nice things that if you put in typos, get does a pretty good job of saying, is this what you meant get status instead of what I put stouts? Yeah, get status. And so everything is good. We're two commits ahead of origin master. So let's pull to double check. Everything is good. Everything's up to date. And then we push our changes up to GitHub. And we now go to GitHub, hit refresh. And we now see that our license file is included. And it's here as the MIT license. Okay. And also then one of the nice things that GitHub does for you is that it recognizes that this is a license file. And it tells you a little bit more about the license. It's commercial use is allowed, modifications allowed, distributions allowed, private use is allowed. The condition is that the license and copyright notice has to be included wherever this code is used. This is generally considered a very permissive license and is commonly used among people that are trying to foster open and reproducible research. Okay. So we're getting set up in the next code club. As I mentioned earlier, we'll go through and we'll start doing some brainstorming on GitHub about where we want to go. And we'll start building out our project and doing our analysis. Thanks again for joining me for this week's episode of Code Club. Be sure that you spend time going through the exercises on your own to help reinforce your new skills. Even better would be for you to take the ideas we've worked through today and think about how they relate to your current projects. I'd love to see what you did. Please feel free to drop a line in the comments below to tell me where your repository lives on GitHub. It'd be great to see how you're progressing through the project. Also, please let me know what types of data analysis questions you have and I'll do my best to answer them in a future Code Club. Finally, be sure to tell your friends about Code Club and to like this video, subscribe to the Rufimona's channel and click on the bell so you know when the next Code Club video drops on Thursday. Keep practicing and we'll see you next time for another episode of Code Club.