 Welcome back to the Riffamonus Reproducible Research Tutorial series. In today's tutorial, we'll be doing a deep dive in diversion control with a program called GIT. We've already seen and used GIT in the paper airplane exercise where we used GIT on Github, and in the last tutorial, when we created a new repository for the COSIC analysis, which we'll be using for the rest of this tutorial series. As we'll see, version control is useful for making our methods open to others, collaborating with others, and tracking the history of our project. These are all important for making our analyses more reproducible. I think version control is useful for helping with reproducibility also, because it allows me to have another type of documentation for my analyses. Not only can I track the changes to my code, but I can also annotate those changes with the who, what, when, and why of a change being made. These annotations become little notes that I leave myself for the future to mark what I was thinking and where I was going. I know I'm weird, but for me, pounding out GIT init at the command line to create a new repository is one of the best feelings I have in science. It means I'm starting a new project. Of course, I may only get to type that command a few times a year. Most of the time that I'm using GIT, however, I'll only be using five or six commands to do everything I want. That's what we're going to cover today. It's easy to get flustered with GIT and to get confused. Take it slow, practice, and try to understand what we're doing. As you go through the materials today, I'll try to emphasize how to safely use GIT for your projects. Like the last tutorial, this one will have a lot going on in it. Again, feel free to take it in chunks. Use that pause button to stop my yammering and practice things for yourself. If you get stuck, feel free to use the comment box below or to send me an email. Now, join me in opening the slides for today's tutorial, which you can find within the reproducible research tutorial series at the riffimonus.org website. Similar to the previous tutorials, before we get going on today's material, I'd like to have you do a little exercise. So let's assume that you're at the root of your project directory, and there's five files in your root directory that all end in .r. These are r scripts. So based on what we talked about in the last tutorial, what would be the commands that you would need to write? Two commands. To create a directory called code, and then to move those r script files into the directory. So go ahead and think about this or write them out longhand. Feel free to hit pause if you need some time to think about it. Great. So hopefully you came up with a way to move those files into a directory called code. Now there's several ways to do this. And again, the key is that you get the files into the directory called code. How many lines it takes you to do it or how you do it doesn't really matter. So one option would be to say mkdir code, and then individually move file1.r to code, file2.r to code, and so forth for all five files. If you do this, you might find that, well, five might be the limit of the number of files that you would want to do this way, just because it gets kind of long and tedious and drawn out. Certainly if you had a hundred files, you wouldn't want to do this. Alternatively, you could do one mv command, one move, where you have the five r files all on the same line going into code. It's effectively the same thing that you had in the previous example, where you had the five mv commands, but we're doing this in one line. Okay, so this is still a lot of typing. It's still kind of tedious. A final option that's pretty succinct and straightforward is to, again, make your directory for code, and then you can do mv star.r to code. That star tells bash to match anything that comes before a .r. So if we had a hundred files that ended in .r, this one line with that star would allow us to take all those r files and move them into code. Okay, so hopefully this was review, and these little bash skills are really helpful to practice as we think about working within the command line environment. The learning goals for today are to define and justify the use of version control. We're going to talk about different types of approaches to version control. We're going to use two projects to apply version control to our projects. We're going to think about documenting the changes and the differences between different versions of our repositories, and then we're going to look through the history of a project to see how a file has changed. And then we're going to start using Get with our big analysis, this Kozitch reanalysis project that we will be working with for the rest of the tutorial series. So to get started, I want you to think about whether or not you keep track of different versions of things. Do you back up your data or your code? So hopefully you have a backup of your data and your code. If you don't, maybe you should start. But where do you store this? How do you store it? Perhaps you're using Dropbox or Box. Perhaps you have things stored on Google Docs or something like that. Perhaps you know that the server you're working on or the computer you're working on has Time Machine, so there's a backup there. And so those are all ways to back up your code or your data. And have you ever needed to go back to a backup? Under what circumstances have you had to go back to something because you perhaps accidentally deleted it? And so perhaps it's because you accidentally deleted something, that perhaps you accidentally deleted your thesis directory where you've got all that great writing you're doing. Perhaps you also were doing some programming and you found that you had introduced a bug at some point and you needed to go back to a previous version of the code before you introduced the bug because you're not really sure where you introduced the bug. So these are all reasons why you might want to keep track of different versions of things. So if we have different versions of things, then we might run into problems. We might have conflicts. We might have difficulty knowing which is the correct version. So how do you keep track of these different versions of your project? Where are they kept? How do you compare the different versions? How do you know the differences between the versions? And how difficult would it be to go back to a previous version? So you can imagine that things get very complicated very fast when you have multiple versions. And so one place that we see multiple versions being used a lot is when you're working on a manuscript. So if you're a trainee, you might be the person in the lab that's responsible for writing the first draft of the manuscript. And then you share it with other people in the lab and perhaps your PI or perhaps other collaborators. And very quickly you might have five or six different versions of that manuscript floating around. And so now you've got multiple versions. And so now you have this problem of how do you keep track of the different versions? Where are they when they come back to you? How do you organize them to know whose version was what? And so one of the things that we can perhaps all relate to is this comic from PhD Comics of somebody's data for a project that they're working on where they start out with great intentions for this data from 2010, 05, 28, May 20, 2010. But then they just go crazy because they can't keep track of what's going on or perhaps bugs they're introducing. And so finally to make it clear they say, use this one. But a couple days later they realize that we're starting over. So this is something we can all relate to where we think we've got the right version but alas, we don't. We've got to keep iterating to try to improve the code or the data analysis or the data generation as we go through. What I'd like you to think about is something that you're probably already familiar with that as a graduate student you were taught the proper methods for keeping a lab notebook. You're perhaps told to give each experiment a title, a rationale, a hypothesis to describe your methods, data, results, conclusions. And so now that you're in grad school, now that you're doing more bioinformatic analysis, perhaps you have a project that involves 10 million 16S RNA gene sequences. So what are you going to do with that in a lab notebook? So what I'd ask you is have you had this situation in the past? How have you dealt with it? I think a lot of people struggle with thinking about how to organize something like a laboratory notebook for keeping track of their data analysis. And so what are the strengths and weaknesses of the approach you've used? So perhaps you write everything longhand on pen and paper, or perhaps you, what might be another option, perhaps you're copying and pasting things into a Word document. So the problems with those, of course, are that it's pain in the butt to write out all the code for a project you're working on. So here's an example of a notebook that I kept about 10 years ago in 2008. This is literally my handwriting. This is my research notebook from November of 2008, where I am writing out, as you can see on the left side of this notebook, bash commands. On the right side are the calls to purl scripts that I had written, and a justification for what was going on. And this notebook I still have with me and has about 200 pages in it of this stuff. So I have a lab notebook, but it was painful to write. You can imagine with the various numbers that are in here that perhaps I'd transpose some numbers or I'd get it wrong. Perhaps I want to go back and change a parameter. Then I need to notify myself on this page that, hey, if you go forward to page 250, we have new results. So it's there, but it's not really useful. And it's prone to errors. The other thing is that I can't easily copy and paste this into my bash shell. So if you had a Word document, that's nice too because it's now at least an electronic format, but the Word document isn't executable. I can't run that Word document to run my analysis. I'm always going to have to be copying and pasting and moving things around, and I don't trust myself very well to get that all right. If I needed to change a parameter or to add new data, again, it's going to be another page or two in my laboratory notebook or in my Word document to keep track of things. And I might be tempted to write over things in my Word document, losing kind of that history of what I had done previously. There's also a big push now. I know here at the University of Michigan there's a lot of things for bench science. And again, these all seem to have kind of a proprietary feel to them that if you move institutions or if you want to share your notebook with someone else, they need to have access to the lab notebook system you have. The other problem, of course, is that if they're designed for wet lab work, then they might not be geared for working in computational science where, again, we're running commands from a command line of scripts on, say, a high-performance computer like Amazon. And so it's a way to document things, but it's perhaps of limited use. As a case study, we might return to this idea of drafting a manuscript. And so perhaps you've had the experience that this guy over on the right side here finishes his manuscript and calls it final.doc. Their manuscript is good to go, right? Give it to the PI. His name is like Professor Smith or something. Maybe it takes them a couple of weeks to get feedback to the student. In the meantime, the student found some typos. So they make some changes. At the same time, the PI is making changes. And then it comes back together. And now there's conflicts between your versions, the PI's versions, perhaps your collaborators' versions. And so you've got this mess of text. And track changes is great, but it can only get you so far when you're merging differences between different versions of a manuscript. And so again, you have this naming problem that you see on the right side here of this slide. But it's a big problem. And so as you might predict from the content or the title of this tutorial, one of the tools that we have that can help with this is version control. And so what we're going to talk about is a tool called Git, G-I-T. And Git is frequently a source of confusion and headaches for people. I think that's because it's totally different than what people think they've experienced before. But the reality is, as we've already talked about in the previous slides, you've seen things like this already, right? That you have some naming scheme for all those different files that you're trading with Dr. Smith. Perhaps you're using Dropbox where each day you archive all your files into a zip file and you put those into Dropbox or something like that. Well, you're using a laboratory notebook where you're commenting what you're doing. And so Git and GitHub, which we've already seen, are useful tools for helping us to think about how we can maintain versions and something we'll call branches that we'll talk about in a future tutorial, that how can we look at these versions and integrate different versions to not have this big mess that we commonly experience when we're trading files back and forth. We've already talked about Git and GitHub when we did our paper airplane example and then when we started the Kozich re-analysis repository in the previous tutorial. GitHub is a web service that's built upon Git. So GitHub is not Git. It uses Git, but it's not Git itself and it does a great job for making Git more accessible. And we've already seen that when we were trying to push our repository from the Kozich re-analysis up to GitHub, it gave us really nice instructions on how to do that. So public repositories are free on GitHub. Public means that anybody can see it. Some people might be a little bit unsure about making their repositories public. If you're an academic and you contact GitHub, then you can get unlimited private repositories. I would encourage you to try to be public as much as possible. Again, one of the things we're hoping for is greater collaboration. If nothing else, when you go to submit that manuscript, be sure that it has been made public. I've reviewed papers in the past where they reference a GitHub repository and it's been private. Accidentally, we submitted a paper once where it was private and that was a bit embarrassing. So be sure to add that to your checklist of things before you submit a paper to make sure that your repository has been made public or if you're comfortable with it, make it public from the start. You can also create a group account for your lab and so my group has an account called SchlossLab. That's kind of a final resting place for all the lab projects and papers. So if a former student in my lab had a repository that was connected to his account and he leaves or she leaves and then they delete the repository, delete their account accidentally, well then it's lost. But if that repository had been stored within SchlossLab, then I keep it forever, so to speak. So I'd really encourage people to set up lab accounts that can then be the final resting place for all their lab projects and papers. So again, Git is hard. I don't want to diminish that fact, but you can do it. It's not that hard. It's not insurmountable. And I feel like Git and GitHub go out of their way to really try to make it easier. And again, everything that I use in Git I'm going to be presenting in today's tutorial. So you may not realize it, but you're already a Git pro. As we've already talked about, we used Git and GitHub during the paper airplane activity. We also talked about it in the last tutorial in starting the Kozich re-analysis project. We added files to the repository. We committed them. We linked to a remote repository. So these are all aspects of Git that we've already used that you perhaps didn't already appreciate. So again, Git can be frustrating and hard, but it really doesn't need to be, and I'm probably making this a bigger deal than it really needs to be. So we're going to return to the command line now and start to use Git and customize Git to work with a couple projects. So what I want to do is I need to also log in to my AWS instance. So hopefully this is second nature to you now, but I'm going to quickly log in. Great. So like I said, hopefully this was familiar to you. As you saw me typing here, I made a few goofs myself. So again, you can go back to the notes from one of the previous tutorials on working with remote computers using AWS EC2 instances if this seems a bit foggy to you still. Okay. So I'm going to type control L to clear my screen. And we're going to start doing some customization of our Git within our EC2 instance. And so to configure things for Git, we need to use a command called git config. And so what types of things do we want to configure? Well, we want to tell Git who we are, what our email address is, what type of text editor we'd like to use, and things like that. So we're going to enter a few commands here. So I'll do git config dash dash global. And this tells Git that these configuration settings are going to be global. So regardless of what repository we're in, these configuration settings are going to be held. So enter your name, your email address. This is a set of color options. So different things will be lit up in different colors. This sets our text editor. We use a program called nano, which we've already used. That W flag is helpful for working with Git. So this credential helper cache timeout 3,600 is a tool that tells Git that we need to reenter our password for GitHub every 3,600 seconds. So every hour. Again, this is a setting for how to push. And now finally, those are the various configuration options we're going to set. And we can look at the settings we have by doing git config dash dash list. And so this tells us what our configuration settings are. And again, because we use the global flag when we're running git config, these are the options that are going to be held across all of the repositories on this Amazon instance. Now, with these configuration settings, we can also create aliases. So we might have a set of commands that we frequently use. And we can create an alias so that we can take some complex git command and make it easier to type out. So one that we'll use in a bit is called git diff. And there's a flag called word diff. So we could do git diff dash dash word dash diff. And frequently I can't remember if it's word dash diff or diff dash word. And so what I'd like to do is to create an alias that I'm going to call diff word. So we're going to do git config global alias.diff dash word quote diff dash dash word dash diff. So again, we'll see this in a bit. But as your git commands perhaps get more complicated or easier to try to do different things, know that you can create these aliases. And also if you, in the notes for the, for this tutorial, there's a link for online materials from git and GitHub with some other examples of aliases that you might consider trying. As with all these things, sometimes it's more difficult to remember the alias than the actual command. So your mileage may vary on some of these things. All right, so we're going to create a silly project to emphasize some concepts using version control. So we're in our home directory and we know we're in the home directory because we have the tilde to the left of that dollar sign. But we can get to the home directory again by doing cd tilde. I'm going to make a directory called silly repository and then I'm going to cdn there. So I'm going to now create a new file called readme. So right now there's nothing in here. I'll do nano readme.md. So I'm going to grab some random text from a cool website called hipsom.co. So there's a set of texts called loram ipsom which is Latin placeholder text. And so some smart people created a more modern version called hipster ipsom which is artisanal filler text for your product. So yeah, so we're going to do hipster, neat, no Latin and we'll let it fly. And so you can see that we now have some random hipster text. Wainscote PBR&B Tofu Banjo. So I think it's random so you'll probably get something a bit different than I got. But I'm going to highlight a paragraph here, copy it, come back to nano and paste it in. And at the top here I'll add a second level heading, readme, and I'm going to save that, Ctrl O and then Ctrl X to quit out. And if I ls, I now see my readme.md file. So now we'd like to create a repository. We want to put this file under version control. Like I said, it's silly, right? So what did we do before? Do you remember? Right, we can write get init period. So we run that and we see initialized empty git repository in home ubuntu-silly-repository.git. If we type ls, it doesn't look like anything's changed. But if we type ls-a, we now see there's a directory in there called .git. And as we've talked about in previous tutorials, if there's a period before a file name or before a directory name, that means the directory is going to be hidden unless we use this dash a option with ls. So let's type ls.git and we see there's a whole bunch of directories and files in here. And what's going on is that when we create a repository by doing git init period, we're taking that current directory as the root of our directory and making that under version control. And so this .git, then, is a directory that contains all the information that's necessary to make it a repository. You do not need to look in this directory. The only time I ever look in this directory is when I'm teaching people how to use git. So you really don't want to touch this directory, don't delete the directory, don't modify anything in the directory. No, it's there. Be appreciative that it's there, get about it, don't worry about it. If you want to remove your project from version control, the easiest way to do that would be to delete your .git directory, but we don't want to do that. So some concepts to keep in mind about git is that it's text-based. You can put .gix files or PDFs or jpegs under version control, but that's really not what it's designed to work with. It's far better for working with text files, with markdown files, or rscripts, things like that. Git will keep track of the changes, who made the changes, and when they were made. So once something is put, or, we'll say, committed into the repository, that's actually pretty hard to remove it. So this is a good thing, because it's really hard to screw up then. It's really hard to permanently delete anything. It's bad, because if you accidentally commit a very large file, or sensitive information, like, say, your password, or patient health information, it can be hard to remove. And so we're going to talk about ways in the coming slides to avoid accidentally committing things that you don't really want in your repository. So let's check the status of our repository. And so we're going to learn now one of the commands that I use frequently. So I'm going to do Ctrl-L to clear my screen. And if we do Git status, we see the status of our repository. There's a lot of jargon here, right? So we're on the master branch. We've got our initial commit. We have untracked files. Readme.md is in red. It tells us that there's nothing added to commit, but we have untracked files that are present. And to use git add to track that. So that's a lot of jargon, and I don't expect you to understand it right now, but hopefully by the end of this tutorial, this will make a lot more sense about what's going on. So let's break this down a little bit. What does it mean from that git status output? So we have this document, readme.md. Where in this schematic do you think it currently resides? So we can think of this gray rectangle.git as being our repository. That's right. The file to the left here that's outside of the repository is our readme file. And so right now our repository, the silly repository repository is empty, and git.git is empty. And so this readme.md file is being seen by git, but it's not actually part of the repository, but once it added to the repository. We did get instructions on how to read add readme to .git. Right? So it tells us to git add file to include in what will be committed, and that it's an untracked file. So let's go ahead and do this now. And so I can do git add readme, and so I'm gonna run git status again. And so now we see that readme.md has changed from being in red to being in green, and it says it's a new file. It also tells us that if we want to unstage this, then we can run git rm cached and then the name of the file. But we don't wanna do that. And so committing the file now will allow us to add that file to the repository. So again, what we've done with that git add is we've taken readme from the left outside of .git. We ran git add, and that moves readme.md into the staging area. Now, if we run git commit, that will put the file from the staging area into the repository. And so we can commit it by doing git commit-m, and then in quotes we're gonna leave a note to ourselves. So I'm gonna say add a readme that any hipster would be proud of. And so it gives us the message add a readme that any hipster would be proud of. One file changed, five insertions, create mode, blah, blah, blah, a bunch of jargon. So you might have a different number here, this 255875A. This is a marker that allows git to keep track of the commits. And so you're gonna have a different number here than I do. The actual number is much longer, but it's just showing us here the first seven characters of this number, it's called a hash. It's a mix of numbers and letters. Great, so we have now completed our first commit. This is similar to what we had done when we were working on the paper airplane example at the bottom of the screen where we added in a message about adding instructions to fold a paper airplane. So we did that on the website. Here we're doing it on our Amazon instance. Good job. So when I use git, 90% of the time, I'm using git status, git add, or git commit, just like I've shown you now. So if you can handle these three commands, you'll be in great shape for 90%, if not more, of what you need to use git for. Right, so let's be serious about this. This hipster ipsum text is a bit silly. Let's use our text editor, nano, and edit the readme file to make it a little bit more sensible and save the file. So what I'd like you then to do is to run git status and think about where we're gonna be in that diagram. All right, so I made some changes. It's not really critical what the changes were, but hopefully you were able to modify some text. So if we want to check the status of our repository, what do we do now? Right, git status. I'm gonna... So again, this tells us that we're on our branch master. We don't really know what that means yet, that's fine. We have a modified file called readme.md, right? We modified the file to edit it to make the text a little bit more meaningful. But it also tells us that our changes are staged... are not staged for commit. And so this is a little bit different than what we had up above when we created the file, right? It said untracked files up ahead, above, and now that we've committed it previously, it now tells us that after we made a change it says modified readme.md. So no changes added to commit. We need to do git add to add the file and then commit it. But before we do that, what I'd like us to do is do git diff readme. I'm sorry, git diff readme.md. And so what you see here then is that you'll see the readme text and it will tell you in red what was removed and in green what was added. And so you can compare these lines to see what changed, right? So I added an add and between our PBR and B and tofu added some make some killer music, but that's really difficult to read. It's difficult to figure out what's going on in there. And so we can go back to that alias I made, which was git diff word readme.md. And so this alias then allows us to more easily see the changes between versions of our readme file. All right, so again git status tells us that we need to add the readme file. So I'll do git add readme.md, git status. It's been modified. It's ready to be committed. commit-m, make the readme content a bit more serious. And so it says one file's changed. There's five insertions, five deletions. Voila. And again, if I do git status on branch master, nothing to commit working directory clean. So we haven't gotten very far in this project, but we can type git log, and this will output to us what changes have been made over the course of the project. What are the commit messages? And so these, at the bottom, we've made two commits. And so the bottom commit was the original commit, and the one above it is the second commit. So you might remember that when we had that commit message earlier, it started out 255875A. Again, this is the hash, as it's called, that corresponds to this commit. You're going to have a different commit hash, but this is useful because it's a unique identifier for this commit. And so you can see that it's got my name and email address because I added those. It's got today's date as I'm recording this, and it's got my log messages, my git log messages. One area of customization that people frequently like to work with is how this output is seen, and so you could do git log dash dash help. And this then shows you the manual page for git log. And so, again, there's a whole bunch of different things in here, and if you Google around, you might find a lot of different options for how people change the coloring or how they change the output. But I'm going to stick with this because it's fairly easy, and the projects we're going to be working on are pretty straightforward and don't need a whole lot of sophistication in our log message format. So git diff is a very powerful tool that allows you to see how your files have changed over time. So take a look at these and figure out what changes are happening or what the following versions of the command are doing. For this last one, again, this is the hash of the git commit message. Git diff readme.md. Nothing's changed because if I do git status, nothing to commit, nothing's changed. What git readme, git diff readme will show you is how your current version of readme.md compares to the most recent version in the repository. Similarly, if we do git diff head readme.md, nothing changes. So head is the top of the repository, and so the output there is the same as git diff readme. If I do git diff head caret one readme.md, this then compares what I've currently got in the repository to one commit prior. Another approach that we could take is that we could copy this commit hash, and again, it will compare it. So we've only got two commits in our repository history, so that is the same as head caret one. But you could imagine if you've got 20 or 30 commits, you might not want to count all those. You might want to grab the commit hash. You might also not want to grab the full commit hash. So you could do git diff, and then if I come up here, I might just grab the first six or seven characters of that hash. So git diff the hash, readme.md, and get the same thing. So as our project gets more mature and we have more commit messages, using the head caret and then a number will get you a comparison to that many versions ago in the commit, in the repository. Alternatively, you can use the full hash string or the first characters in the hash string to compare what you currently have in your directory to what was in the repository at that commit. So a brief word about commit messages. This links to a very long and detailed article about commit messages. Some of the points in this website, in the blog, are beyond where we're currently at with the tutorial. And so it's helpful to separate the subject from body with a blank line. I'll talk about what that means in a bit. Limit the subject line to 50 characters so that when we do that git commit-m, what you put in the quotes, try to keep it short. Shorter than a tweet. 50 characters, something straight and direct. Capitalize the subject line so the first character of that commit message should be capitalized. Don't end the subject line with the period. It uses extra space. Use the imperative mood in the subject line. So be imperative. Edit text to make more clear what happened. Solve bug that caused error in colors of plot. So have a kind of declarative mood. Be more kind of emphatic about what was going on. If you're writing longer commit messages, wrap the body at 72 characters and use the body to explain what and why versus how. And keep it PG. People have done analyses of commit messages and found a large number of F-bombs in there. And also do your best to keep your commit messages as meaningful as possible. So on this right here, you have a great XKCD comic where somebody starts off great, created main loop and timing control, enabled config file parsing, blah, blah, blah. Miscellaneous bug fixes, code additions edit, more code, more have more code. And it just kind of devolves. I think we've all been through this where we think we have a bug solved, but we don't. And so we keep recommitting and recommitting and eventually we're just sick of writing commit messages. So let me show you what a more sophisticated commit message might look like. If we reopen up our readme, but I have their logo on a hexagon distillery copper mug. So again, it doesn't matter what you type. Just type something. Get status. And so we've got this modified readme. I'm going to go ahead and do get add readme. Now I want to do get commit. But I want something longer than that dash and commit message, so I'm going to type get commit. This will then open up a editor in Nano for a commit message. So I will add information about band, mug, but band swag. And then say, and so we want to use the body to explain the what and the why versus the how. So I think, okay, and so this is a short body, but it could be quite a bit longer. And so a couple of things to note is that we have a short declarative imperative message, a subject line, and then we have a body. And it's the body because we put a blank line between the title and the body. But one thing that we didn't quite do is that we have a run online here. And this is relevant because of how it will render, be rendered when we do a get log, or when we look at these commit messages on GitHub. So I'm going to add some breaks, and that's great. So now I can do control O, control X, takes me out. If I do get status on master branch, nothing to commit. If I do get log now, I'll see the subject line as well as the body message below that. And so, again, by having this more detailed commit message, I can get a better sense of the what and the why behind what happened in this commit. Sometimes you might be doing a lot of things in a commit, and so you might want to put better annotation in of what was going on between that and the previous commit message. All right, so now it's your turn to do some heavy lifting. I'd like you to get a hold of the MIT license and copy that into the silly repository and use that to create a file called license.md. So I want you to then edit the license to add the year and your full name so that it's you that the license is for. Add it to the repository using our commit workflow. And then I want you to lean on some of the stuff we did in the last tutorial to create a GitHub repository and push your silly repository to GitHub. Then, on the GitHub version, add your PI's name to the license. Back in the terminal, type git pull and see what happened. So this might take you 15, 20 minutes or so, so go ahead and pause the video. And when we come back, I'll quickly show you what I did. So hopefully that exercise wasn't too hard. And when you ran git pull, what you should notice is that you got your license.md down here at the bottom, where there were two additions and subtractions. So we see one file changed, one insertion, one deletion. And that, no doubt, was the change in the license. And so if we do nano license.md, we now see that on our Amazon instance, we have, at least in my case, Mary Smith, who is a fictitious PI, is added to my license. And so again, we can see that through this pushing, as we did up here using the commands we copied from GitHub, as well as pulling, we can push content, push our repository, we can push our repository up to GitHub, and then we can use git pull to pull changes back down. So somebody might be working on our content on GitHub, but we can then pull those changes back down to our local repository. So 90, 95% of the time, I'm not going to be making changes to the GitHub site, and so I'm not so worried about pulling content down. But if I'm working with a trainee in my lab, and they are making changes to their repository, before I do anything, I'm going to do a git pull to pull down the freshest content from their repository. And again, when I'm done making commits, I can then push those changes back up to the repository. One of the things I like to avoid using is and in my commit messages. I want my commits to be about one thing. Now, don't get me wrong, I don't want it to be about changing line 3 of paragraph 2. I want it to be perhaps I edited the abstract of the paper, or edited abstract of paper to keep it declarative. I mean, this is not a big deal, but it's a way also to kind of keep our changes, what we might call atomic, where they really represent one thing, and if we want to go back to, where did I introduce the changes to the abstract? Bam, they're right there in that commit message that say added edits to abstract. And as we already talked about, we can generate bigger commit messages by doing git commit without the dash m. And we've already talked about this, where we can get longer commit messages in the body. I like to make a commit before I'm about to do anything risky, so that if I have a fallback position, I liken this to when I run Word with EndNote. EndNote always seems to crash Word for me, so I always make sure I save Word before I build a bibliography with EndNote. So similarly, if I'm about to make a big change to a plot I'm working on, I'll go ahead and commit it, because as we'll see in the future, we can always fall back, we can get rid of all the changes we made, or the last commit. So how often should you push? You can push your commits as often as you want. I try to do it at the end of each work session, so that I have a backup. So remember to pull if you've changed something on GitHub, which we'll be calling the remote, before you push again. And in future tutorials, we'll talk about other ways to deal with this in a way that's a little bit safer, so multiple people aren't able to submit commits, push commits to the same repository. So we're trying to clean up our home directory, and we do this. Now what do we do? So this is not the end of the world. So let's go ahead and do it. So I'm going to do cd tilde, back to the home directory. If I do rm-rf, silly repository, ls, it's gone. How do I get it back? Well, this is also one of the things that's great about version control. Is that if I hit refresh, it's still here. It's still on GitHub. I still have a backup. And so there's a green button here, clone or download. So if I click that, I can click on this copy to clipboard, and then down here, I can do get clone, and then that address to the repository. If I hit enter, if I type ls, wow, it's there, silly repository. If I cd into silly repository and do ls, everything's still there. And if I do nano-license, I see my name and my PI's name are here at the top. So one of the benefits of GitHub is also the ability for it to provide a backup of our repository. Wonderful. So we've really learned a lot about Git and GitHub so far, using this admittedly pretty simple repository or silly repository. So patch yourself on the back. That's great. And again, everything we've done is 90 to 95% of what I use Git for. Let's return to the Kozic reanalysis now. And so to do that, we can do cd tilde to get back home, do cd Kozic reanalysis, a.m. 2013, enter ls. Let's see what's in the directory and I already forget what we did yesterday in the previous tutorial. And so if I do Git status, this shows me that we have four files, at least the way I did it, I had four readme files where I put information about getting the code, the raw data, the reference data. And now I also have a directory for code mother. So I don't want the files from mother included into my repository because they tend to be large and they're not going to change. And so what I'd like to do is go ahead and stage these changes to then commit them. And so we could do Git, add, readme.md, code forward slash readme.md, data raw readme.md, data references readme.md. Now, I wrote these all out longhand and I would encourage you to do the same. People run into big problems when they get a little aggressive in how they add files to their repository. And this is where people then get into problems with committing password files or patient health information or gigantic data files. So I like to be explicit as I'm entering in the files that I want to commit to my repository. So I'll hit enter. And if I do get status, I now see that those four readme files are now green. They're marked as modified while my code mother directory is not being tracked. So I'll go ahead and do a Git commit and I'm going to leave off the dash m and I'll do insert code into readme files. Put an extra list, extra space. Obtained mother executable rawfastq.gz files and Silva and RDP reference files. These are the raw materials that I will be using for my re-analysis project. And I'll again save this. Ctrl-O, enter. Ctrl-X brings me back out. Again, get status. I now see that everything's updated except for code mother. If I do get log, I see that I have my initial commit as well as this new commit. One thing you might notice is that the author of our initial commit was Ubuntu, which is a login name for this instance. And so that's why it was useful for us to go ahead and do that git config global to put in our name and password. So we could keep going with this project and as we do this, we could just ignore that code mother at the bottom every time we go through. But that would get pretty tedious and more than likely I'd be very prone to accidentally add that to my repository. What we could do is ask git to ignore this for us. And perhaps you've noticed this that we actually have a lot more files in here that for some reason git is not seeing. If I do ls data raw, whoa, there's a whole bunch of files there. Why isn't git seeing those? Well, with this template, project template that I gave you, by default, we told git to ignore certain files. And so one of the special files that's really useful is called .gitignore. So we can do nano.gitignore. And this is a file that every line has a different type of file or directory that git is going to ignore. And so as you look at this, you might see, oh, this line here has data slash star slash star. So git is going to ignore anything that is in my data directory. But below that, there's a line that has data process star, but there's an exclamation point at the beginning of the line. So the exclamation point tells git to don't ignore this directory. And as you look through, they'll find other things that git is going to be ignoring. It will ignore your word files. It will ignore temp files. It will ignore your mother log files. And so the other thing then is that git is being told, do not ignore readme files. So our raw data files are in data raw. Our reference files are in data references. And so from this line, git is being told to ignore the data raw and the data reference files. So what we'd like to do is to tell git to ignore code mother. How would we do that? Well, right here we could type code forward slash mother forward slash star. Again, we can do control O, control X. Now I can do git status. And I see it's ignoring. It's not seeing code mother, but it sees the change I made to git ignore. So I'd like to go ahead and commit that. So I could do git add, git ignore, git status. So it's been modified and it's ready to be committed. Git commit dash M, ignore mother executable files. Now if I do git status, everything is good. So to build upon that git ignore file, I'd like you to take a look at the file and if we do nano dot git ignore, given what's in your git ignore file, what I'd like you to do now is take some time and go through these seven bullet points and ask yourself whether or not git will ignore these directories and files. So with this first one, data process, this will not be ignored because there's a line in the git ignore file that has the exclamation point data slash process. Will it ignore data raw read me? No, it will not because the exclamation points read me tells git to pay attention to any read me files. Will it ignore figures figure one dot png? Nope, because there's nothing in there about figures as a directory or png files or anything else. Will it ignore mother log file? Yes, because it ignores it will ignore those log files as it said in the git ignore file. Will it ignore data mother mice filter? Yes, because we're telling it to ignore anything that has the directory data. Will it ignore submission my awesome paper dot doc x? Yes, because it's got the doc x and we're telling git to ignore anything with a doc x. Will it ignore read me dot md? Nope, it will track it because this is a read me file and our git ignore file is pretty explicit to follow read me files. So let's return to our repository. We'll quit out of the nano if you're in there. And again, let's do git status. We see that our branch is ahead of origin master by two commits. Origin master is what's up at GitHub. So if we use git push as it says to publish your local commits we can do git push. And so those are all up at GitHub now. And so if I come back to pshloss I look at my pshloss kojic reanalysis I see that my git ignore file has this commit message ignore mother executable files. And I see that my read me has the commit message insert code into read me files. And I can also look at this link for three commits to see the history of my various commits. And you'll see that we've got this short declarative sentence. And if we hover we see obtained mother executable raw FASTG FASTQGZ files and Silva, blah, blah, blah. If I click on that it'll expand it. So again, GitHub makes it nice and easy to see our code history. As we talked about earlier when I do a git add I want to be explicit about the names of the files I'm adding. People get into huge hurt, huge problems when they try to cut corners and they use things like wild cards or they find tricks to commit all of the untracked files in their repository. These cause big problems. I would really discourage you from using wild cards or finding ways to add all of the changes. Worth it for peace of mind to be careful and to be defensive about using git. Again, what I'm showing you for this workflow as we go through this of doing git status, git add, git status, git commit, git status, that is my workflow. That is how I do it. It seems like a lot, perhaps you could commit, perhaps you could cut out some of those git status steps but it's really helpful and gives me great peace of mind about exactly what I'm adding, what I'm committing as I go through. What if we accidentally add something to our repository? Let's say I I'm going to create a file. To create this I'm going to do nano code forward slash figure one dot r. I'm going to add in some r code here. Let me just do this is some r code that is pretty generic. It will generate 100 random numbers between 0 and 1 and that will generate a histogram. I'm going to go ahead and save this and I'm going to do git status, git add code forward slash figure one dot r. I'll do git commit. I'll do git status, git commit dash m add histogram. I'm on master branch, it's ahead by one commit and I can push it to publish my local commits. One of the things about pushing is that you want to be really sure that you want to push that change. If for some reason I went back and decided I don't want figure one dot r to be part of my repository, then once it's been pushed, once it's been publicized then it gets that much harder to remove it from the commit history. I've had second thoughts I'd like to remove that figure one r file. How do I remove that file? Something you might be tempted to do is to do rm code figure one dot r. That will work, but that will disrupt the history. What's better is to do git rm code figure one dot r. If we run that and now we do git status we see this has been deleted. Code figure one dot r has been deleted. Then we can do a commit. It tells us that it's ready to be committed. We can do git commit dash m remove figure one code. If we do git status and we then do ls code we see that we don't have figure one dot r. We've got other code in there, but we don't have figure one dot r. Let's go ahead and create another r file. We'll do nano code myscript dot r and we'll again do x assigns runif hundred hist x. Git status git add code myscript git status git commit add new histogram git status Excellent. My PI is looking at my code, looking at my repository and PI says pat code myscript is not a very descriptive name for your script file. Why don't you give it a better name? I'd like to change the name of that file. Again, we could do mv code myscript dot r to code generate figure four dot r. But if we do that mv then we're going to disrupt the history that's being tracked in git. So we can go back and do git mv code myscript code generate figure four. So if we run that instead and do git status we now see that it's been renamed and that it is ready to be committed. So we can do git commit dash m give histogram figure code a better file name. And so it says in that commit output the commit message as well as that it renamed the code. And we can then do git status. So we've done a lot today with git and practicing that workflow of using git status, git add, git commit git push, git pull. We also talked about git init, git diff, git log. So there's a lot in there, but like I said the commands I use the most are git status, git add, git commit, git log and git diff. Those will serve 90% of your needs. What I'd like you to do now is to work through a series of problems. So eight bullet points here that ask you to do different tasks using git. So I don't want you to write these out at your command line but go ahead and perhaps jot them down on a piece of paper or in a text file and think about how you would do each of these using git. And so you can then press p to see the answers. It's important to remember that we need to go back to our console and quit out of ec2. So I'll do exit, exit and I'll go to ec2 management console instant state stop. Yes, stop. Well I really hope that you took the time to pause the video and play with the hipster Ipsum text and our own Kozitch reanalysis pipeline. It's great that we're starting this project with version control but you know what? It's never too late to start using version control. If you have a project that you're midway through or even a project that you're almost done with go to the root of that project and write git init. If nothing else it will give you a backup of your work to this point in the project. As we continue through our analysis we'll be using version control to document all of the changes that we make. Hopefully you notice that we aren't tracking the large files with our repository. There are options for storing large files with github but I find these to be a bit tricky and even with the extended storage options we'll quickly outstrip their capacity. So say I was to accidentally delete my aws instance or just the directory. What would it take to recover? I don't want to say that it would be easy but it wouldn't be the end of the world. I could clone my repository from github into a new directory a lot like we did with the silly repository example. That would get me all of the code and the smaller files but what about those big files? Hopefully my readme files are clear enough that I could get going again. In the next tutorial we'll see how we can automate much of that process. Actually that nightmare situation is really a good motivation for thinking about our own reproducibility. If I were to accidentally delete my project directory, how much work would it take to recover the project? I'll leave you to think about that until we meet again in the next tutorial.