 I guess I'll go ahead and get started with some of the introductory content first. Thank you all for joining us today for this workshop on designing a reproducible workflow using R and Git and GitHub. My name is Sophia Lafferty Hess. I'm a research data management consultant within Duke University Library. So a lot of my role focuses on talking to researchers about their organization practices, documentation, metadata, what tools to potentially use during designing their workflow with the potential goal of sharing and archiving their data at the conclusion of the project to enable things like reproducibility, reuse and foster further invasions. And as part of this work, I'm also a member of the curation team that supports the Duke research data repository, which I'll be talking a little bit about at the end of this workshop. And now I'll let my colleague John Little introduce himself and his role here at Duke. Hi, I'm John Little and I'm the data science librarian and also in the same center for data and visualization sciences with Sophia. And I teach a lot of our workshops and generally consult with people also one-on-one to help them sort of navigate our generally our syntax issues. And we're excited to be here today to talk about this is a great, great, should be a great workshop. Thanks, John. So to go into a little bit more detail on the Center for Data and Visualization Sciences, which John mentioned that we both work within, we call it CDVS for short. We have staff that can support multiple different aspects of data-driven research. So as we saw, we have staff with expertise in data science techniques. That's John. We also have staff with expertise in GIS and mapping, data visualization and data management. So a big part of our services is providing consultations. We are available either over Zoom or in person. You can reach us at askdata at duke.edu to send us your data question and we'll see if we can help. If we're not the right people, we can also try to help connect you with others on campus. We are, as I mentioned back in person, some most afternoons for a couple hours, for walk-in hours within the Brandilioni lab, which is, if you haven't been to campus in a while, on in Vostok library, the first floor within the edge. That lab is also available if you're looking for certain types of statistical or visualization or GIS mapping software, so that can help with kind of your computational needs. Another big part of our service profile is providing workshops like today. So this is a selection of workshops that we've provided over the last year or so, so you can see kind of the scope of some of the things that we discuss. I will say that this is our last workshop of the season, so that's exciting, but we will plan on publishing our spring 2022 workshop list in around mid-December, so you can just visit our workshop to learn more about what we're going to have coming up next semester. All right, so to go ahead and jump into our topic today, just frame this with a little bit of learning objectives. At the beginning, we're going to help understand, help us all kind of build a shared understanding of what is reproducibility and its impact on research and scholarly inquiry. We'll then really dig into identifying a few specific practices that can be implemented in your research projects to enhance reproducibility, and then at the bulk of the workshop, we'll be presenting a potential end-to-end reproducible workflow using some data management coding and dissemination tools. So to start off with a few definitions and framing, so that we can build that shared understanding of the terms that we're using. So for many of you have heard this term reproducibility as well as replicability, so some may use them synonymously, or in some disciplines they are kind of used in slightly different ways. So I just want to present you with a couple definitions to make sure we're all on the same page as we move forward with these concepts. So these definitions come from the National Academies of Sciences, Engineering and Medicine, where they say that reproducibility is obtaining computational results using the same input data, computational steps, methods, code, and conditions of analysis. Whereas replicability is obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data. So essentially when we think about reproducibility versus replicability, reproducibility is when you have those same procedures, those same methods, but you're using the same data and the same code that the original study collected. So essentially this is where John publishes an article where he has his findings, his figures, and his tables, and his graphs, and all of those good things. And then he'd make his data and code available so that then I could use that data and rerun his code to see if I could get those same results. Whereas replicability is when you're again using the same procedures and methods and analysis techniques, but you're going out and you're collecting your own data to see if you can reach the same conclusions that another researcher has related to their hypothesis or research question. And you may be using generating your own code or you may be using shared code that's being provided to the community. So I want to pause here to make sure that there's no questions or comments about this differentiation that I'm making. Today, we're going to be focusing a lot on computational reproducibility, but the things that we're going to be talking about can also lead to making science more replicable. So if you have questions, I should have said this at the beginning and apologize, but please feel free to enter them into the chat at any time. During this first portion, it's going to be a little bit more lecturing. So if you have questions or comments, please do contribute there. During the hands-on portion, you can also add your questions to the chat, but I would note that we're not going to really have the ability to stop and troubleshoot questions. So if you have specific issues that come up, we would just encourage you to kind of follow along and then we can set up follow-up consultations to work through any configuration issues you might have. So again, I'm not seeing any questions in the chat, so I'll go ahead and move on. All right, so what? So why are we here today talking about reproducibility? Why should we even care about this concept? Well, one of the reasons is that a lot of people are talking about it and writing about it, and there's been studies that have found that science isn't necessarily as reproducible as would be ideal. So across multiple different disciplines, there's been attempts at reproducing others' works and replicating others' works, and we have found that it can be a challenge at times, which has led to this idea in term of that we're in a reproducibility crisis. So this is from a survey of 1500 scientists where they nature asked, you know, do you think there's a reproducibility crisis and you can see a significant portion or saying either a significant or a slight crisis? And then they also asked, will have you actually failed to reproduce an experiment, either your own experiment or someone else's? And you can again see pretty high percentages there around failure to reproduce results. Another interesting article that came out last year was put there was a challenge put forth by a neuroscientist to ask other scientists to go back and try to rerun some of their code that was 10 years old and to see if they could still produce those results. And suffice it to say there were challenges, particularly around things like missing documentation. So the scientists really had to spend a lot of time trying to understand, well, what does this chunk of code even do? Or, you know, what was my process here because they didn't have really comprehensive documentation. Likewise, they struggled with obsolete computing environments. As we know, 10 years in the scheme of things when we think about digital information is actually a relatively long time. So in some cases, they got a little creative about recreating that environment. So if you're interested in this topic, I do think it's a good read and it really highlights, again, not only the challenges of reproducibility today, but as we think about reproducibility over time and some of those other things that we'll have to consider to, if we really want our science to not just be reproducible for the next year or two, but 10, 15, 20 years from now. And as we know, some studies are, they are building upon those ideas and really have that scope in time. But as we think about this as a crisis, I like this idea and some folks have put forth that we should really reframe it. And it's not necessarily a crisis, but we could see it as an opportunity, especially as we think about reproducibility and kind of the trajectory of science over time. We're always progressing. This one quote points out, you know, we didn't even know about antibiotics over 80 years ago. So what can we do in the next 80 years to really develop procedures, develop norms, and a culture that supports making our research more reproducible, making it more replicable? And how can we kind of do that as a community without feeling like it's, it's something that's like, it's a crisis, we need to do this with panic, but it's something we can do calmly and through thoughtful planning. But I will say it does take some extra effort. But in the end, the biggest beneficiary is going to be you yourself, because you'll be able to go back to your own research and know what you did, know why you did it, and again, then be able to build upon those those ideas. So while there are obviously other beneficiaries, when we think about the public good and how this serves the scientific endeavor, more broadly speaking, really the biggest beneficiary is going to be you. And we can also think about reproducibility on a spectrum. So it's not necessarily an all or nothing thing. There are small things that you can do to build into your workflow to make your your work research more reproducible. And as we look at the spectrum, in many ways, the kind of more traditional model around scholarly communications, where we present our results through a journal publication, is not reproducible at all. And then we can kind of build on that to make our research more reproducible first by making our code available. So through tools like GitHub, or other kind of social code coding tools. Next would be making our code and our data available. And then the next step is to make that data and code not only available, but make it available in an environment where it's linked and executable within the cloud. And then finally, again, that gold standard is full replication where we're actually collecting new data to test that hypothesis. All right, so now we're going to jump into some of those practices that enable more reproducible research. So this this quote comes from a book on the practice of reproducible research, it is openly available. And they give both kind of an overview of how what reproducibility looks like in practice, as well as providing some nice case study to see how it's actually been implemented by researchers in different disciplines. So they said, you know, at that beginning level, one of the most foundational things to building a more reproducible workflow is having a good organizational system where you just have a clear directory structure where you know where your data are, where your code files are, and we'll give you kind of an overview of what that can that can look like. And what's usually suggested for reproducible research during this workshop. Next, we have documentation. So you'll want to make sure you're creating metadata that actually describes your files and what's the purpose of them and what's found within your folder structures, as well as documenting any manual steps in your process. And I'm going to say a little bit more about documentation, I think it's even broader than than just this. And then finally, not finally, then the next one is automation. So it's building into your workflow, things that allow you to perform each of your steps automatically. In many ways, doing manual point and clicks within like a user interface is never going to be very reproducible. And then building out that automation is thinking about your overall workflow design. So how you conceptualize the connection between the tools that you're using. And that's one of again, one of the things we're going to be presenting a potential workflow design today. And then finally is dissemination. So making sure you're providing that information, those, those notebooks, those documentation and your data and your code to the broader community with enough details about your process that they could actually go back and rerun your code on your data. And we'll again talk more about all of this in practice. So to present from a high level, a conceptual view of the workflow we're going to be walking through today. So again, your organization and a lot of your automation would be implemented using two key tools, Git and GitHub and R and RStudio. So this is where all your files would be stored, as well as using those coding and scripting files for automation. For documentation, I would say that a lot of the documentations, in some ways your code is part of your documentation, right? But you'll also want to make sure you're producing some of those other kind of contextual documents to document things about your process, things about your files. Oftentimes, we'll see this as read me files, kind of a high level document explaining the reproducible, compendium or data package. But I'd also encourage you to think about your documentation kind of in two key levels. One is documenting fully the content of the data themselves. So oftentimes data is not self-describing. If you have a tabular dataset and you don't define your values or your variables, it's relatively useless. So you'll want to make sure you have really clear definitions for all of your variables, values, how you worked with missing data, what are your units of a measurement, are there any abbreviations. So being really explicit about that, because you can't understand the data themselves and you can't trust the results from your code. And then next, the context of the research. So this is kind of your study level. So what's the source of your data? Having a clear data citation for any secondary data? What's your methods? What's your processing steps? What's your relationship between your files? Documenting your software versions, your packages used. So all of that contextual details about your actual procedures and methods for your research study. And then finally, as we look at dissemination, we're going to be using a couple tools. And that's the NOTO, which is a repository specifically built for archiving code as well as other types of materials and binder, which is a containerization tool. So and that will be a way that you can actually disseminate your materials and dating code to the broader community. So again, we're going to show all of this kind of in practice with examples, but just this is a high level of what we're going to be looking at today. All right. So I think that is my section. So I'm going to actually stop sharing. And if you have any questions at this point or any comments about kind of the concept around reproducibility, we welcome them at this point. So again, I'm going to stop sharing and I'm going to let John take over this show. I'm going to talk to you about some practical aspects of how to implement the concepts that Sophia just talked about. But you will get a link to that slide deck and it has some links to the more specifics about specific techniques for implementing a lot of what we're going to talk about or what I'm going to talk about now. And so this is going to be the very hands on thing. But I'm going to start by just talking about version control, which fits into the sort of organizational aspect on this slide. And we're going to distinguish between Git and GitHub, but we're going to actually use both. And so in order to do that, I'm going to present Alice Bartlett's slide deck called Git for Humans. Alice did a great job of creating some really, I think, smart slides on what Git is. And so I just want to attribute her work. And obviously, she shared it to be attributed. And so we can be thankful that Alice is such a good designer. Let me set this up. Alice, I don't know her personally. I don't know any of the people that are mentioned in this slide deck. And I've never saw this presentation. I just saw the slides. Apparently, she developed this slide deck to present at a conference for designers and developers, web designers and web developers. And she is a developer, a web developer at the Financial Times, or at least she was at the point that these slides were produced. They're a little bit old, but the information is, by and large, still very good. I think they're about five years old. And so that's the setup of who she is. And she presents this slide. Not sure why, but I want to emphasize it because it makes sense. She designed these slides. She says, things are just better when designers are involved and she certainly makes a better slide deck than I do. But then the setup for this whole slide deck is what tools could you not live without that designers don't have. And you can substitute the word designer for data analyst or whatever. She's making the point that within your specific domain, your subject domain, you have a suite of tools that are unique to your discipline. And then there are other tools that add to that, that augment your work, that make your work more possible. And that's her question and her answer is Git. And Git is for version control. And as you'll hear me say, Git and GitHub are different things, but they work symbiotically together. And as we work through this slide deck, I hope to define both for you. She asked the question, what is Git? And she references this computer science professor, Tom Stewart, who says Git is an application that runs on your computer. It says like a web browser or a word processor. And I want to alter that definition a little bit. I think it is, you know, first of all, let's note that word processors have become almost ubiquitously cloud oriented. So I'm not sure that a word processor is always is a clear explanation because it's not always an application that runs on your computer. But sometimes it's a cloud application that runs in your web browser. But if we take the definition that a web browser is an application, yes, Git is technically absolutely like that. I don't know why she references or what Tom Stewart was saying. But I would even argue that it's actually a little bit sort of half step below application, half step above operating system. It's very much a utility tool that for the most part runs in the background. And your need to address this tool once you get used to it is pretty minimal. So what does it do? Well, it helps you manage the work done on your projects, not just on files, but on an entire project. And that's a really important concept that we're going to define. But there is a real problem with Git, and that is that it's unfriendly. So let's take a look first at this is somebody's Mac terminal, I believe, and Git is primarily a command line interface tool. And historically, in order to interact with Git, you had to type commands out of the command line. And even going forward, you still may occasionally have to type a command, although I rarely do. There are a lot of clients that we'll talk about that sit on top of Git that make it easier to use. And conveniently, our studio is one of those clients that can sit on top of Git and orchestrate these Git commands so that we don't have to know them all. But it's helpful to sort of understand where we're coming from. You'll see right here, it says Git checkout. So that's a situation where they're writing a command to, to check out a particular view, a branch of repository. Let's see, we'll go down here and follow another one. They're doing a Git poll. Origin master sounds very jargony, where they're actually pulling some data from GitHub, most likely. Git branch is where they're getting a listing of the different views of their project. And it goes on and on. And it initially feels very strange. But like I said, there are clients that sit on top of that. So it's helpful to know where those come from, but you don't have to memorize them. And actually, as you use Git, there's like, I would say I use five Git commands 95% of the time. So once you know them, they're kind of hard to forget. And they're not all that hard to, to implement. But the client is easier. And it does feel very jargony to the, to the newcomers. All right. So I'm going to flash into this slide back and forth. There's a question of why is Git so unfriendly? And Alice flames this guy right here. His name is Linus Torvalds. And I'm going to scroll forward. Just remember that picture. And I'm going to scroll forward because it just feels so aggressive. But Linus Torvalds is, if I'm pronouncing his name correctly, is the primary was initially a primary moving force behind Linux, which is literally the largest collaborative, cooperative software application on the planet. Right. And there needed to be a way for large groups of people across multiple time zones and multiple languages, spoken languages, natural languages. And perhaps multiple project oriented project management approaches, there needed to be a tool that enabled all of this sharing. And so he developed this thing called Git. But Linus, Linus Torvalds developed a bulletproof tool. You cannot argue. I mean, Linux, Linux is extremely successful. Git as a result enables that and Git is extremely successful. But he didn't put any time into the user experience of Git. And so let's kind of break down the simplicity of it. Let me also note, because you may have seen the email I sent out that because Git was designed for a collaborative software development, there are an incredible array of extensible features to Git, which I would argue most of which you don't know if you don't need to know if you're not developing software, they're not super relevant. But the pieces that you will need to know for developing your data analysis workflow are pretty simple and pretty straightforward once you learn about them. So Alice's slide deck defines them in five separate things. This thing one is Git lets you tell the story of your project. Okay. So by that she means you can take a snapshot of all of the files in a folder. And that folder is then called a repository or a repo for short. Now, conveniently, that folder can also be an RStudio project, which if you've used RStudio projects, you know that a folder can be an RStudio project and simultaneously, and I'm here, I'm telling you it simultaneously can also be a Git version control folder on your file system. And the real clear advantage here is that by taking snapshots of the entire folder of a state and time, then you can roll back to a state and time for the entire folder. I hope to explain that a little bit more in a minute, but it's an important concept. So you do that, you take these snapshots by doing what's called a commit. Right. So here's kind of an old school view, or I shouldn't say old school, but many people do this a view of how people manage files as they change without Git, right? You might create a file called logo, and then you change it, you call it logo two, then you get some feedback from Monica, so you call it logo three. And then you kind of finalize what you want to do and you have logo three dash final, but as is Murphy's law of creating changes like this, particularly the ones where you call them final, it's never done. So now you even have logo three final one, right? And over time, when you look back at no matter how precise you are about these file names, when you look back at this maybe separated by about six months, you'll have a hard time sort of figuring out exactly what was happening and why it was happening. You'll have a pretty good sense that final dash one is probably the last file, but if you're anything like me and you've done this kind of thing, even that is not so clear when you look back at it. So what Git does is it allows you to keep the same file name as it's changing, right? So logo on the left-hand side, that's the original file, on the right-hand side you commit that into the repository, and it's called, I don't know what that, I'm going to look at that later. It's still called logo, right? So you make a change, you make a commit, it's still called logo, you make a change, you make a commit, a file is still called logo in this context. This is sort of the old way, this would be the version control way. Every change, you keep the same file name. Okay, so when you commit files, you have the opportunity to add some information to the change log, right? Minimally, that's who and when, and those kinds of things kind of happen automatically. Who made the change and when they made the change, Git will manage that for you, but you can also add a commit message, a little bit of information about what change happened, which is often very useful. This is an example of a commit message and in my opinion, or at least in my experience, this is a pretty verbose commit message. You're not committed to, you're not obligated to make such a verbose message, but you can say a lot about why you made a change in a file, right? Or why you made several changes to a project at that commit moment. So then commit messages look a little bit like this, right? Rather than changing the file name, you have commit messages, it says Alice Bartlett of 1030 made this change to logo, and then she added new colors for a US election campaign, and then she fixed the orange color, you know, et cetera. So in this view of a linear history of how your project has changed, moving from the oldest on the left to the most modern on the right, or the most recent change on the right, up to the current state of your directory or your repository, each one of these circles represents a commit. And so you can see that it has that date information, and you can see that it has the who information. This is sort of a stylized representation of what's changing, and you can see that there's a little bit of commit message information. So you know the state of the project as it changes, not just a file that you could roll back to, but the entire project in context with all of the other files gets committed. And so there's some fancy algorithms in the background that make sure that this doesn't fill up your file system. It's very, very file space efficient. And we don't need to know about any of that, we just we can just trust in it. Okay, so we've learned two more things. We've learned the word repository and we've learned the word commit. And then we can take advantage of that and do what's called we're going to call it time travel. Unfortunately, you can only travel backwards in time. I mean, once you travel backwards, you can flash forward to the present, but you can't go into the future. Right. You would expect that. So this is what I'm talking about when I say that get allows you to roll back to a context of a project in a previous moment. Right. So it's not just taking a snapshot of an individual file, like what happens with one drive and box and drop box, you can version an individual file in those tools, and you can roll back to a previous version of the file in those tools. But it's much, much more difficult to say, okay, well, you know, if I look at excel spreadsheet that I modified six months ago, in relation to the other files in that folder, I would then have to go back and roll back all of those other files, whereas get has a more context aware notion of the project and allows you to move back to that snapshot. So it does that by the way, this is sort of background information, you really need to commit this to too much memory. But it does that by assigning something it calls a hash to each one of the commits. And it's a really long, unique number that if you had to reference it, you don't even really necessarily, I've never typed out that entire string of numbers usually you can get away with just typing the first five. But it's so long to ensure no collisions. So each one of these is a is a hash. And that you can reference or it's a commit that you can reference by the hash so unique sort of identifier. And so if I wanted to roll back, what is this one two, three, four, five spots in the linear story of my project, I could just check out that particular commit referencing this particular hash. And in this case, I'm going to go back and look at a time when I deleted an icon. So let's set up a scenario where maybe I deleted it. I mean, maybe in this scenario, I want to go back to this one before I deleted the icon. But the scenario I'm going to set up is that I delete an icon because I think it's useless. I don't like it anymore. I throw it away. And then I flash forward six months and I decided actually the design on that was really good. I'd like to pull it back out. Well, I deleted it. It's no longer in my project folder, but it is still in the get history. And so I can roll back and get it, right? And doing that rollback is what's known as a checkout, right? So I can check out this particular hash ID. And that's going to make that project folder look like it did back on the 20th of May in 2016. And so these other five commits going forward are effectively invisible. They're not gone, but they're just masked for my purposes because I want to look at the state from May 20th. Right. So now I've learned the term hash. We've learned the term checkout. And we have a good sense of how you can journal or chronicle or make a written history of your project as it evolves over time. And then a nice thing that get allows you to do is it allows you to experiment with your repository. Oh my goodness. Someone's calling me. All right. So let's see. What does this say? This isn't really how projects work, right? What they're saying here is this idea that everything is super linear and you move from one stage to the next and you never have to reference anything else and you don't want to experiment at all. That's not really how we work in real life. And so Git has this concept called branches, which are effectively exact copies of a state or a commit in your repository. Right. And then you can make those branch copies and you can be experimental with them. So an example of that that I do, for example, I teach a lot of workshops on R and I generally create a Git repository for every workshop, but I'll usually sort of tweak it every semester. So I'll have a version of the workshop that I want everybody to look at, but then I'll make a branch copy to make my changes and experiment with things. And that's kind of pulled off to the side because I do most of my work publicly. The branch and the public version are available to anybody, but I have permission controls that I can apply if I need it to be very private. So let's take a look example of what that looks like. Here's an example of how these slides are a little bit old. You may know if you're heading an eye towards how our culture has had some very sharp conversations recently, you may know that the term master is falling out of favor, but master in a legacy sense refers to the main branch of these repositories. And I'll show you towards the end of this how you can change that name to main. But in this slide deck, I mean, linguistically, it has a specific meaning that means main or principal, right? So we have this main branch and then we're going to make a copy of that branch and we're going to call it add new styles. And then I have this main branch that the whole world is looking at. And then one, two, three, four, five have added five different changes to the branch. And this one is based off of this, but of course, in a completely different state. And I can switch back and forth using the word checkout. I can check out master and then check out add new styles. I can go back and forth. Or if I share my repository with somebody else, they can go back and forth. I can make multiple branches. So let's say for an example, this is a website and I decide that I want to change some colors to make some things sort of a darker, more duke shade of blue. And then maybe I also do a wild experiment where I make links in there rather than a duke shade of blue, I make them hot pink just to see what it looks like just for fun. Maybe on a lark, I'm going to show it to some of my colleagues. So now I have three branches or three instances of this same project. But the main branch, the master branch, hasn't changed at all. Another way that you might think about this is you've got your project done. You're ready to turn it into your principal investigator, or you're ready to turn it into your thesis committee or whatever. But you have some wild ideas that help and then you're wondering, well, what else can I do with this project? Make a branch, try those ideas out. You're not going to be in danger of not being able to roll back to the pristine version that you want to turn into somebody. But if you decide that you want to keep those changes, you can with a term called merge, right? So the command here I would write is, I would check out back to master and then I would merge add new styles. And it's going to automatically make each one of these changes, bring each one of these changes forward, so that now master and add new styles at this point in the project are the same. Meanwhile, I have this sort of what's called orphan project that's just sitting out there, which I could either delete or leave there forever. Okay. So I've learned two more terms, branch and merge. And let's talk about thing for, it helps you back up your work. Now, many of you know that it's a really good idea to have multiple copies of your data, of your analysis, of your scripts. I used to work in the systems department. So we would call that sort of the belt and suspenders approach, right? It's sort of never harmful to have a lot of copies of your stuff because you can't predict what kinds of weird things would happen. And if you can make that process of having lots of copies simple, then you have a lot of confidence that you can sleep easily and feel like, well, multiple copies could go down and there's still a copy sitting around. So for example, I typically work, use get as my repository, get hub as the place to host my repository. And I'll work on a work computer. And then sometimes at home, I'll get that work off of the GitHub repository, and I have another copy of it. And then I can move that stuff back and forth. We'll talk about that in a minute. But right there, I have three copies of the work. And if I share it with somebody else, for example, Sophia, that's a fourth copy. All right. So I always, I'm never quite sure exactly what the purpose of this slide is. But I know that Alice is here comparing how box works in comparison to get. And I don't know exactly what she says. But it's clear that having multiple copies of your work is a safe idea. And I would certainly recommend that you do it. Now, the process of putting your local get repository up into GitHub in get terms is called making a remote or using a remote or pushing to a remote. So here it's worth noting that get and GitHub are different things. There are all kinds of remotes. And oftentimes, these remotes are also known as social coding sites. And there are multiple that you may have heard of, GitHub, GitLab, BitBucket. You can put your remote into all three of those if you want to. A lot of times, institutions and companies will have their own instance of one of those tools. For example, Duke runs a local version of GitLab. And so there you can have it, you could put it into a fourth remote if you want. The GitLab one is only accessible by NetID. So by nature, it's a bit more private. We haven't talked about permissions. But in any case, get and GitHub are different. But the kinds of things you can do with tools like GitHub is more about managing a project in terms of creating documentation, tracking issues, identifying bugs, creating a little discussion board with your collaborators who may not all be on the same city, may be spread out across the world. You can manage all of that kind of collaboration, project collaboration, at a remote social coding site, such as GitHub. And so the way that works is somebody, I could put my stuff up on, my remote up on GitHub. And then if you wanted to pull it down for the very first time, you would do something called clone. Or you might do something called fork. Doesn't really matter for our purposes, which one. So let's take a look at an example of how that kind of collaboration works, and how you have multiple copies just proliferate intentionally. So we're going to identify two other people and start off as having this remote up at GitHub be a representation of a repository. So we're going to add in this guy named Martin, and he's going to clone that repository. And now you can see that they have the same commit history of the project. And then we're going to add in this, this person named Lucy, and she's going to clone the same remote. And so everybody's up to speed, they have the exact same history of the project as it has evolved. And they're all on the same page. Now what happens is Alice decides that she's going to make a change. In this case, she's fixing some color icon tinting. It looks like by using the FFF code, she's making it brighter. This is removing some kind of bug. So she makes a change. And as you can see, she's one commit ahead of everybody else, right? Now, in order to enable everybody else to get up to speed with her, she has to do something called a get push. And automatically when she pushes in the background, it's going to push it up to the main remote unless we do something different. So it's going to go into the main branch of that repository. And then the onus is on Martin because he's one behind to do what's called a get pull. And when he does a pull, he's up to speed. Everybody's on the same page. All right. So we've learned about collaboration there, but we've also learned about push and pull and remote and clone, all of which are super useful. Now, let's talk about thing five, get helps you collaborate by telling the story of your project. Well, I've kind of mentioned all of that already, I think. Get allows you to work on the same project, which is, and so this is actually a super critical slide. I mentioned how it can be kind of a pain to get to start working with get. As a matter of fact, a lot of people be included. When I was using Dropbox and Box on a regular basis, I looked at get with a little bit of trepidation. It seemed weird. It seemed hard. There is indeed a learning curve. And as we noted, there's a horrible user experience to get. But remember that those are not project synchronization tools. They're single file version tools. And so once you make that leap, it feels like a chasm, but it's really kind of a tiny little leap over a little creek. You have to pay a little attention to where your feet land, but not much. Once you make that change and you start using get on a regular basis, to be perfectly honest with you, that sense of trepidation moves from looking at get from Box or Dropbox to having a sense of trepidation about all the people who are using Box and Dropbox at OneDrive, because it's not nearly as full featured. It's not nearly as bullet proof. It really doesn't work as well. It's much harder to sort of resolve problems on those tools. So I certainly encourage you to consider using get. We're going to do some hands on here in a minute. But this is why once you start using it, all of these features are why people sort of suffer through the user experience. And of course, if you're using a client like our studio, most of the things that I just mentioned, you don't even have to remember those terms. You could just click on things to make it happen, because our studio will allow you to orchestrate the get to get hub version control process. So just to reiterate, get tells the story of a project. It allows you to travel back in time. It allows you to experiment with changes using branches. It allows you to back up your work into multiple distinct geographic locations. And it allows you to collaborate with others. Even if you're the only person you intend to collaborate with, which is a big thing. It's in my opinion, it's totally worth the effort. All right. So those are Alice's slides. And I want to once again, Alice Bartlett, thank her for developing those slides. Super helpful. I'm going to check the chat here because somebody put in some people put in some comments. We are going to do a hands on where we're going to do a kind of a hello world project of creating a get hub repository from using only our studio to manage the whole process. All right. So the first thing we're going to do is we're going to type this, if you got my email, I mentioned how you could set up get in advance. And so now we're just going to make sure that that's happening. Let me just at least say, going back to the notion that get can be initially kind of a real pain. We're not really going to have time to troubleshoot problems that you may experience. So I only ask for your patience if you get stuck. So let me share a different screen here. In anticipation of all this, the first thing I need you to do is log into get. So if you go to github.com, you won't see this screen, but you'll see a sign in link over here. And so go ahead and sign in. And then the other thing I want you to do is I want you to load your RStudio client, which ideally previously you had either you had the opportunity to set it up, set up your get your RStudio client so that it'll work with get by installing get. But we'll run through some of that again. But if you haven't done any of that, feel free to just watch. It's helpful just to see how these things go. And the very first thing I want to do from just from the standard RStudio view is I want to click on this word terminal. If you don't have that, that, and by the way, I'll mention this again to reiterate. There has been so much change in how RStudio orchestrates get that it is totally worth it to update everything. Have the latest version of the use this package, have the latest version of RStudio, have the latest version of R. If you run into problems, that might be something to try even before reaching out for a consultation. But if you don't have this terminal, I think we can go to view and do to show terminal. And I'm going to change the font size here in the hope that it makes it a little bit easier for you all to see. Oh, I have to restart in order for that to take effect. So I'll restart. And there's the terminal a little bit bigger. So this is the console. Again, I'm going to go to the terminal. Feel free to follow along with me. And now here I'm in the terminal. I'm going to type get dash dash version just to make sure that my RStudio knows about where Git is on my system. If I don't get a response, something like this, that means I haven't installed Git yet. And so you can check back on the email that I sent about how to go about doing that. But basically, I just Google the words git and download. And it's pretty straightforward, just accept all the defaults. Right. So now that I know that I have this installed, I want to do something else here. I'm going to go to and now that I'm so I have get installed, I'm logged into GitHub. And I'm going to type in rfund.library.took.edu. And actually, I'm going to put you can get there from this main site. Right. And maybe I should. But if you just click into this version control GitHub module, you'll find a link to a guide. And that guide URL, which I'm going to post into the chat in just a minute, just has the word get after it. And let me find my chat button so that you have this link. And what this guide has is a little cheat sheet on things that you need to do, such as the one time only setup of making our studio aware of how it's going to interact with GitHub. And then also some other steps here. So what I want to do is I want to type this command into now the console, not not the terminal. So we're no longer doing the terminal. So you go back here. I'm going to switch to console. And the easiest way to do this actually is just type library use this package that I've installed. And then what does that command I'm going to type again create GitHub token create GitHub token. When I when I hit enter, this is going to launch me into GitHub and give me an opportunity to create a special what's called personal access token or private access token. It's just for you. You shouldn't share it. Right. So I'm going to hit enter. And it just it just launched me into GitHub. And now I have to do something fancy to find my GitHub password. That's what I was doing in the background and from the password. All right. So this is the screen it puts me in. Right. So here I'm looking at a new personal access token. And it gives me a chance to put in a note because I can have multiple tokens. So you want to put in something descriptive that means something to you. I'm going to put in the name of this machine, which is 23302. Oops. 23302. I'm going to call it November workshop. I don't have to use the underscores. It's just a habit. And I'm going to call it I'm going to write this. Those of you have been because I'm not going to use this token. And then I'm going to click generate token. And I get this screen. Here's the token. You're only going to see this once. I recommend that you click on this icon right here, which allows you to copy that token into your buffer. Okay. Now I'm not going to use that token because because I don't I actually don't I already have one working. Tim asks, is this something we would do for each project? Or is this a one time? So it's a one time procedure, Tim, for each computer that is going to access GitHub, but not for each project. And so it's something, by the way, going back to this, this reference guide, this is a little three minute video that kind of takes you through the same steps. And if you're anything like me, these steps are done so infrequently that I can hardly ever remember them. And I myself look at that video probably two or three times a year because I can't remember everything I'm doing, but you can copy that token by clicking on it and don't click delete, but I'm going to click delete because I don't want that token to exist. But now that you have that copied into your buffer, go back to our studio where you're going to type the next command git creds colon colon git creds underscore set. Let's take a look at that. Back in our studio. Git creds colon colon git creds set. All right, when I hit enter and I should put that actually I can put that I don't need I don't need the I copied careful because you just copied a long token into your buffer. So don't overwrite your buffer, but I'm going to give you that command so that you can see it when I type this, this is going to give me the opportunity to set those to paste those credentials into my system, you won't necessarily see this error message because it says it's basically telling me I already have credentials and do I want to replace these credentials or keep them. I'm going to I'm going to say keep, but if you clicked on replace, it's going to give you an option to paste in that token that you just have. Right. So let me do keep because I want to keep mine. But once you do that, you'll get a thing that will say basically it'll tell you to paste it go ahead and paste it get enter. And that basically seals the deal that allows our studio to negotiate a super secret handshake between git and GitHub only relevant to that computer that you're working on. Right. So it's something that you need to have private. If you screw it up, don't worry about it. You can delete the token. You can have as many tokens you can create as many tokens as you want. You can try this again. You can refer to that little three minute video I just referenced. But now my machine has the token set into it. And I'm going to go back here to my little cheat sheet. And now what I want to do is I want to check how GitHub knows who I am. So I'm going to put this into the chat. Sorry, I didn't say that quite right. I'm going to check. Oh goodness, where am I? Where is my? Here we go. I'm going to check. Yeah. Yeah, that's actually right. I'm going to use our studio to check that GitHub knows who I am. And so that command is gh colon colon gh underscore who am I? And you're probably you should have different information than me. But this one's going to come back and say that it knows that I'm John Little and then I've logged in as live John. That's my GitHub ID. And it knows my GitHub profile page and some other stuff. So that's all good. Now the other thing I'm going to do is I'm going to use this command. It comes out of the use this library, get sitrep. I'm not even sure what sitrep is supposed to mean. But let's see what happens when I type it. It checks a bit more information. And specifically what I'm looking for is does it know this kind of information? Because it's really good for it to know who you think you are, not just who you think you are up here, but who you think you are here as well. And so if you if you don't have information like that in the email section, then what you're going to do is get this next command. Let me copy it here into the buffer. And let me tell you how you're going to alter this. I'm going to change Jane Doe to my name. You would use your name. You could literally put any name you want to there. If I want to call myself, I don't know. If I want to call myself Sally Cornbluth, I can do that. It's not a good idea. But I mean, I can give myself a nickname or whatever. In the other part, however, I would recommend that you use your NetID version of your email address, jrl at doof.edu, particularly when you're if you happen to be using GitLab. But I think actually, yes. And so what I'm doing here is I'm matching this to this, right? I just want to make sure they're they're the same. Although you'll notice that apparently I changed mine. So I guess even that doesn't make a whole lot of difference. But it's I just feel like it should be something recognizable. And if you can keep keep the same email address, that may help you in the future. Because obviously, I'm confused here by the fact that I've got several different iterations of email addresses. Any case, when I I'm not going to run that, but you can go ahead and hit Enter. That will set your your username information. And then I'm going to go back here one more time. And here's a typo. This is incorrect. I don't want to type gh. Git, sitrep. I want to type use this. Git, sitrep. So let me push this back to you. Sorry about that. I will fix that. And this this step that we're doing here is just to make sure everything's working. And it kind of gives us the same information. So we feel good about it. All right. Total pain in the butt. Thank you for sticking with me. But those are the steps that you've got to do. Now, if all went well, now you have you have the ability to create a GitHub repository using RStudio. Because of all the changes that have happened in RStudio and with the uses package, this has gotten easier than it's ever been before. So we're going to go up here to new project and choose the option for a new project. But when you get to this wizard, and I'm going to talk a little bit longer to give everybody a chance to stay in the same spot. When you get to this wizard, you're going to choose the new directory option. You can, if you've got GitHub already working and already connected, you can use version control very easily to clone Git repositories, clone a Git repository. I'm going to click back. But we're going to start back here with new directory. And I'm going to click new project. And this is going to give me the ability to make a folder on my file system simultaneously, both a Git, an RStudio project folder and a Git version control repository. Both things are going to happen at the same time, mostly in the background. But I'm going to give it a name. So I'm going to call it delmi, hello world, November nine. I'll put it in here just in case I happen to go a whole year without realizing this. When I give it a name and make sure that you check this box right here, it says create a Git repository. Make sure that that is checked. When you do that, you can then click create project. And, you know, RStudio is going to churn a little bit. But you'll notice up here, I now have, I'm now in an RStudio project. We'll talk about projects in a minute. And I want to be back at my console so that I can create a GitHub repository out of this project. This is the project name right here. It automatically created what's called a git ignore file. If I look at that, it's telling me a few files that it will not upload into my Git repository. I can add to that. So for example, if I had a file called supersecretpasswords.txt, by putting that in the git ignore file, I'm automatically protecting myself from accidentally uploading my supersecretpasswords.txt file. I can use wildcards if I want to star dot, I don't know, rdvms, like if I don't want to load up relational database files, all kinds of things you can put in there. But that's what we've got so far. Let me double check my process. Let's see. Got a question coming in. Tim says somewhere, I read, I need to read to make the repository GitHub first and use version control option under new project to paste in the address of the new repository. I don't understand the question. When we think about this, you need to make the repository in GitHub first and use the version control option under new project to paste in the address of it. No, you're not going to, okay, I understand what you're saying. It sounds to me like what you want to do is clone a GitHub repository and we're not doing that. The process we're doing right now is we are creating a new GitHub repository from our local Git repository. So if you want to clone, you would have used a different option in the wizard when we, in our studio when we chose make a new project, we would have gone to version control to clone. Let's hold that for later in terms of details. But you'll notice here in this sort of cheat sheet that I have a couple of commands here that help me do things like create a read me file or add a license file. I'm going to go ahead and run those two, read me and CC. I've already gotten ignore file. I don't need to type use Git because some of those things have happened for me. So let me go back to this directory. Let me make sure that I'm in use this. I have to use this library loaded and let me create license. What is the command? Oh, use license. Use underscore. I'm going to type in use underscore MIT. We'll talk about licenses in a minute. But I'm going to type use MIT license and you'll see that a file will show up over here. And so it just generated a new license file, in this case, the actual MIT license for me. Notice that when you type use license, there's a whole bunch of different licenses that you could choose from. And you're not even bound to use only those. So you can pick. It also created a R dot build ignore file, which you can that's not important for our purposes. It won't hurt anything. It won't help anything. It's just going to sit there. And then I'm going to have a read me file. So I'm going to say use underscore read me. And the easiest one to do right now is the MD. If you're used to editing our markdown files, you could do that as well. But let's just do the easy one. Use read me. And I'm going to hit enter. And it throws me into my editor and allows me creates for me a read me file. So it's, you know, kind of gives me a stub here to start with the goal of this project is to what I'm going to say, create a test. Hello world repository. Right. So I've got a little read me there. I could say, you know, this was done on done on November 9, 2021. And I could, you know, I could name contributors if I wanted to. So let's just go ahead and say, I'm contributing. And let's pick on Sophia and make her a contributor as well. Uh oh, now that now the the truth comes out, Sophia knows I'm a bad speller. So I have to look up how to spell her name. So I don't get it wrong. I pretty pass. All right. So I've got my read me file and I'm going to hit save. And then let's look over here at this get tab. Right. And what we'll see is that I have several files that have changed since I created the get repository, which I'll remind you happened in the background. And I want to commit each one of these files into this version of my repository. And so I'm going to commit them by staging them. I'm going to stage the R build ignore. And you'll notice that little icon change to an a for added. I'm going to commit the get ignore. I'm going to commit the license or stage the license. I'm going to add the read me. Sorry, I should say add. I'm actually adding. I haven't committed yet. I'm going to add the project file. And then what I want to do is I want to click commit. And this gives me an opportunity to put in that commit message that we were talking about back with Alice part with slides. So I can say, this is my first commit to my project. And if I want to, I can put in lots of more words like, I, this is great. You know, do you want to run? I mean, it just has to be a has to be a message that means something to you. But once I write my commit message, I'm going to click commit. And I get some information back from get that basically tells me a type to get commit. And it identified where the commit happened. It identified a commit message. And it told me that it changed five files. And there were 58 separate line insertions. Most of that stuff I can kind of ignore. But there's the hashtag that I could reference for that one commit. I can close that. I'm going to close this to and show you one other thing if I you don't have to do this. But if I click on history, Oh, not that history. I click on history of commits by clicking on this little clock. I get a visualization that shows me some of that background. There's the hash, hashtag. Here's the author of that commit. Here's the date. Here's the commit message. If I had lots of commits, they would all show up there. So now what I want to do now that I've created that commit for my GitHub repository, which is also in our studio project. Now I want to push it up to GitHub because I haven't done that yet. And I need to do one more thing. And that is I need to use this function called use GitHub. And that's going to create that connection between the local repository and the remote repository. And it's conveniently also going to push it up to GitHub. So let me go back here to to our studio, make sure that I'm in the console. I'm going to make the console big, make sure that I'm in the library, use this library and type use underscore GitHub. And then when I hit enter, this should not only not only will it make sure that everything is sort of kosher with the local repository, but it's going to push it. And then it's going to launch me, it's going to sort of switch browsers and launch me into the GitHub view, assuming I did everything correctly. Let's see, there's a message from Fahim that says my computer is going to die and I have to turn it off. Can we have the video? Oh, yeah, absolutely. There'll be a video. All right. So this gives me some information back, which I've had to admit, I've gotten used to just completely ignoring. But it says that it created a GitHub repository, it set the remote, it pushed the master branch up to origin master upstream, and then it opens this new GitHub repository. It gives me the URL, but it also opened it in the background. So if I go to my, if I go to my web browser, you can see that I now have a single commit for this repository. And it's up there for the whole world to see. There's one branch, it's called master. Let's just do a little bit more. This part, you just, you can watch this stuff, you don't necessarily have to do this stuff. But let's just say that I want to make a change. So I'm going to go back to my README file and I'm going to say status update pushed to GitHub on November 9, 2021 at about 1114 EST. Right. So I make a change to my README file. When I save that over here in the GIT tab, it's going to note, oh, a file got changed. So I'm going to click on the staging button, and I'm going to add that. And now I get an M instead of before what I had. That just means it's modified. And I'm going to commit that. And I'm going to give it a commit message. I'm going to say added status info, spelled that wrong. Still spelled it wrong. There we go. A little closer. I'm going to commit that. And it tells me that I made one change to one file with five insertions. I can close that. Let's go ahead and close this again. My next step would be to push but let me just note that I've got a little status message here that says origin master is ahead by one commit. Right. If I click on my local history for my local repository, I now see that I have two commit messages. So two changes. But what I really want to do is I want to push if I just briefly, if I look up at GitHub and refresh, we're out of sync. Right. This has one commit. You just saw that this has two commits. So I'm going to push this up and it's going to work all its magic in the background. It's going to use the personal access token. And now if I hit refresh, you'll see this number changed to two. And you can see that my status changed. And I still have one branch. So let me do that one last thing, which is that I said that I would teach you how to change the name of the branch from master to main. Our studio just made a super cool change that makes that easy. Let me find that so that you can do it too. Master main and I'm going to put this message in. It's a long message. You don't necessarily have to read it all right now. I need to make sure I'm sending this to everyone. Oh, sorry for him. I didn't notice that your message was just directly to me. I apologize for reading that to the everyone list. So the commands I'm going to run are get default branch, get default branch rename, get default, and get default branch rediscover. Let's have a quick look at what that looks like. Back to our studio. Back in the console. I'm going to using use this again. I'm going to type get default branch. And what this is going to do is do some work in the background that tells me the main branch is called master. And then I'm going to type get default rename. And it's automatically by default going to rename master to main, which it just did says it's called it main. And it says be sure to update files that refer to the default branch. But I don't want to run this one last command get default rediscover. And I wouldn't necessarily do that right now, but you can do that later. And those three steps allow me to modernize my terminology. You can actually set that in the config option. So you don't have to run those three commands and always works going forward. But I didn't want to automatically assume that you want to do that. Right. So that's the hands on portion. While we transition or while I transition back to the slides, let me let you guys ask some questions. And I'm going to transition back to my slides if I can find the right one. All right. So hands on portion done. And I hope it was a good experience for you. I'm sure that it was a bit tedious. Like I said, give kind of always a little bit tedious, especially around the initiation and the initial learning curve. But it totally pays off. It's worth it. It gets to this point where it so runs in the background that you really don't even think about it. And it has all of these nice features. All right, there ends that commercial. Let's just run through the rest of the slides of checking time. It looks like it's 1120. So we're doing okay. Some tips. I'm not going to talk about all these tips, but you'll get these slides. One of the tips about automating for a reproducible approach is that you do as much as you can by a scripts. And by scripts, I mean that dot r file or that dot rmd file. Sophia was talking about how there was a crisis of reproducibility and that a neuroscientist added a challenge or presented a challenge to people and said, see if you can run code from 10 years ago. Well, the truth of the matter is 10 years ago, definitely in the 90s, definitely in the aughts, into the teams, the state of the art of computing, what was really fancy and seemed to be what everybody was doing, unless you were a coder, was you were using mouse point and click tools to navigate Excel, to navigate WordPerfect or Microsoft Word or whatever. And the fact of the matter is, is that what's called a GUI environment or a WYSIWYG environment is actually a barrier to reproducibility. For reproducibility, you need to have a linear instruction set that says do this, do this, do this, do this, then do this, and you can add in logic, if this, then do this, that kind of thing. But that kind of logic, that kind of linear instruction set is code. So the recommendation for reproducibility, if you really want to feel like you're very strong in it, is to do it all with code. It just so happens that with R and RStudio in particular, you can orchestrate these reproducibility steps because RStudio has a really mature sense of the data lifecycle from data ingest all the way to report outputs, including orchestrating git version control and pushing things up to GitHub. So do what you can with scripts. A workflow that is not as reproducible would be to data ingest into something like open refine to take a really nice reproducible tool, clean your data, but then you have to export it and then you have to move the exported file with your mouse into an analysis tool like Excel or maybe Tableau or maybe Matlab, whatever. Then you do your analysis and you generate some visualizations and you export the visualizations as files. Then you grab all of your mouse and you copy and paste and move files into your authoring tool, which would be perhaps Microsoft Word. And then you would compose your report. And the point I want to make is every time you're grabbing all of that mouse, every time that you're copying and pasting or dragging files, you're creating barriers for reproducibility. The state of the art of reproducibility is this notion that you have some instruction set that can tell the computer what to do where you don't have to intervene and orchestrate the computer with your mouse. Now that may change. There are tools such as Tableau that are quite aware of reproducibility concepts, but there are not a lot of them. And it just so happens that still tools like Python and tools like R are actually much better at this, at this creating a reproducible workflow allowing you to create a reproducible workflow that just becomes part of your process. So do as much as you can with scripts. I would, let's see, I just say most of that. Well, I'll just note that R Studio allows you to, we've said this before, allows you to make a folder or directory on your computer, also be in R Studio Project. That allows you to leverage relative file systems. We'll talk about that in a minute. Use literate coding techniques. So the difference between, oh, I think I may have, yeah, I think the difference between an R Markdown file and a .R file is that an R Markdown file by default is a literate coding tool. Jupiter notebooks are literate coding tools. Both of them can be multilingual if you have the right coding processor on your computer, coding kernel. It allows you to interspersed natural language with code chunks. The advantage of using natural language is that you can be, you can be used natural language in a full featured way to explain what's happening with the data, to explain what's happening with the analysis. And we would contrast this with, you know, the old school way where you proceeded your code to hashtag and then you write some comment about what's happening. That process can be full featured explanation, but typically, because it's so difficult to compose natural language that way, it's typically very terse and not often as helpful. If you can use literate coding, not only can you be more explicit about explaining what's happening, but you can then leverage that on the back end by generating different kinds of reports, which we'll talk about in a minute. Actually, we're referencing it right here in this slide. Those different kinds of reports can be slides, dashboards, websites, all kinds of things. I have a workshop on R Markdown if you want to check the R fun site. But just to bring home this concept of reproducibility and relative pass, when you're using RStudio projects, you can reference other files and file paths relative to the root of the project. Oops. So if the root of the project says user job projects analysis, blah, blah, blah, that's all idiosyncratic to my file system. There's no way you're going to have that on your file system. If I use, if I reference this exact, this full path out and share my code with you, you're going to have to go in and edit my code just to run it. If I set it up as an RStudio project and then just reference the sub folders and sub files from the project root by saying things like figures or like today I have a script that I cleaned some data and then I want to put that. I want to save it, save underscore CSV or write underscore CSV into the data slash clean folder. The path is relative to the project root. It's just called data slash clean. When you clone that path from GitHub, it's just going to work, particularly if you're using it with RStudio. So contrasting with old school R, that would, that would say you will no longer need to and shouldn't use set WD because if you use set WD, you have to edit it. If you're some, if you're on a different computer or if you're not the author, you shouldn't use RM list equals LS. Instead, you should use the RStudio feature called restart R, run all chunks. We can get into the details of that, but I'm just kind of bringing you up to current best practices. I would definitely recommend that you go into the options of your RStudio and turn off the restore.rdata and change the save workspace Rdata to never. You might also, there's a couple other features, but those two, along with the restart R, run all chunks feature, those are things that will set you up for a workflow that is more naturally reproducible, right? Things will run from the beginning of the script to the end. And if I share it, I'll feel, I'll feel confident that you also will be able to recreate my environment on your computer. Okay, record guide and test, couple little things. If you're using, for whatever reason, you're using some kind of random number generator or whatever you're setting seed parameters, make sure you record those. I mean, ideally record that. Excuse me. Of course it's in your script, but record it in your littered code, put it in your readme file, whatever. Use a main script to execute all of your subscripts. So if you're not using a single script, which I tend to do most often, but if you have a really, really elaborate project, you might have a script just for cleaning data, you might have a script just for the analysis, you might have a script just to generate visualizations, you might have a script just to generate reports or multiple kinds of reports, then you're going to want to have one more main script that does what's called externalization, which is that it orchestrates the scripts in order as necessary, right? So you can have child scripts that are related to a parent script. Don, I'm looking at Don who's, I'm not, I can't tell Don if what I'm saying makes no sense or not. Hi, Don. But it's not immediately obvious why that is useful, but I just want to make that recommendation. And if that's where you are, if you're ready to tackle that, let me know and I'll be happy to help you. One other comment. This is specifically about generating software, but there's no reason why it wouldn't also relate to generating analysis. And that is you want to set up tests so that you get predictable results so that when you make alterations to your data, I mean, you're doing all of this so that you recognizing that your steps are not always discreet. And sometimes you have to take two steps back to take three steps forward. You have to add data, change analysis, generate a new output. You want to set up tests so that you can get logical outputs and verify that things are still working. And Jenny Bryant, who's one of the luminaries in our studio, points out that if you don't set up the test yourself and there are tools to set up those tests and you become the test. So it's just a little motivator that it, you know, it may be a relatively easy script and you can check the output all the time. But as it becomes more elaborate, you definitely want to set up failsafe modes to verify your output. Oh, let's see, tips on dissemination. Share your research data and code ideally through a repository. What we just talked about, how to set up a GitHub repository. It's a great way to share your share, not only share your work, but profile your work. But also make sure you have the rights to share. So if you're putting data in there, is it your data to share? If it's your data to share or if it's not your data to share and it has personally identifiable information, you need to take extra steps, either figuring out an alternative to dissemination or a way to limit who can see that. Maybe it needs to be on Duke's instance of GitLab versus GitHub, which is more public, maybe it needs to be in the PRDN. You know, those things are judgment calls that Sophia and I can help you ferret out, but just make sure to be aware that you need to, you can't just disseminate data without a clear understanding of your privilege to share that data. All right, so this brings us to this idea of once you get that past all of the creation. Let's say you created a report, which is a journal article which gets submitted to your professional discipline. So it's worth noting that their GitHub, which is a social coding site, but also a code repository, is not an archival repository. So their current business model seems to keep everything around, but they're not actually making a promise to you that they're going to keep your code around for 10 years. I'm betting that they will, but betting is going on a hunch. So what you want to do is you want to link your code repository with an archival repository. There's an easy way to do that. There's a link right here in this script that allows, and I think Sophia has something to share with you that includes that link, but it allows you to make a connection between GitHub and Zenodo. Zenodo is an archival repository. It'll basically make a snapshot of a milestone of your GitHub project, perhaps at the point where you submitted something to a publisher, and perhaps again at the point where the publisher accepts after you've made changes. You can make these major milestones in addition to the more atomic commit version changes, and you can mint a DOI or a digital object identifier, which is a unique identifier for the code at that point, with Zenodo will allow you to mint that. I'll show you an example of that in a minute. Git certainly simplifies all of those processes, and of course you would want to link that with your ORCID, which is a unique ID for you as an author. And so here's an example of that. If you make that connection, here's another link to that same tool. If you make that connection, you can, here's an example record. So this is to a repository that I created for teaching R, and I'm just going to click on this DOI that I minted after linking GitHub and Zenodo, and that takes me to a view of my milestone version that I versioned in GitHub, and then Zenodo took over and made a zip copy of that, all of those files, so if GitHub disappears, I could still see the version from that time frame. Ideally, I have a README file that talks about the versions of the packages that I used at that time, so somebody could reproduce that. If I look down here, I can see that I have a version 1.1 from June 7th, and then some moderate changes, version 1.1.1 from June 17th. So those versions get taken into account, and then Zenodo mints a DOI for each one of these versions, and also gives you one more DOI that references sort of always the most recent version. I can traverse from there back to GitHub, which I see right here, so this is my GitHub version release 3, which as you can see has 102 commits, and there's another link to sort of linking back to itself that has the DOI and the ORCID ID and the license all listed there. You can see that this is a version here, but also I can go to rfund-flipped, just go back to the main and notice, actually probably not that clear, that version 1.1.1 has 102 commits, but I've kept developing this code, and so it's a living thing. This is not the archival copy. This is the code repository copy. It's up to 110 commits, so 10 things have changed since I versioned it, and so that's kind of cool that I can keep track of all that. Going back to my slide, we talked about licenses, a couple of things to point out. First is there are lots of different kinds of licenses, so just as a very, very general overview, a lot of people use the MIT license for software, which is basically a no warranty, no guarantee license. It says, yeah, I developed this software, yeah, you're welcome to use it, but crash is your computer, it's not my problem. There's a CC0 license that licenses data for reuse. Again, if you have the permission for the data that's your data, you can certainly license it that way. You're still going to want to put it in a data repository, so you're still going to want to check with Sophia if you're not sure how to do that. There are other kinds of Creative Commons repositories, there's lots of variations of that. That would be more for documents and articles. You can still use copyright, lots of people do. The common de facto process is to give your copyright privileges over to your publisher. From a library standpoint, we would certainly like you to retain your rights if possible. We also have an office of copyright and scholars with communication in the library that is staffed with classically trained JDs, lawyers. You can check with them. They're not going to give you legal advice, but they will help you understand the legal landscape and they'll help you understand how some of these things relate to your particular project. There are licenses, there's also a link. I think this link right here is to some articles on how to choose licenses. And this is just a table that gives you some idea of how to connect some of the concepts we talked about with some of the processes to put those concepts into practice. The one thing we haven't quite talked about yet is this idea of a compendium or a container. This goes a little bit more towards replicability, but just to give you a little glimpse of where things are headed with reproducibility is this concept of a container, which is a live copy of your an exact copy of your project at the point at whatever point you decide. So it can have hard-coded versions of packages from an earlier moment. So for example, if we take that scenario where I submit a paper to a publisher and they send me back changes and I make those changes and I see resubmit it and it's accepted, I might version that and GitHub, mint my new DOI, and create a zero install container that I can share with anybody. I'm going to go ahead and launch that right now if I can figure out how to do it. Shoot, I thought this had the... It might be, John, that you're not in present view, so I'm not sure it's not reading the link the right way. There we go. Thank you. So it's going to launch this container which some kind can take a minute. It's an RStudio container, not a Jupyter container. It's a little bit confusing, but we'll come back to that and see what it looks like, but it is an exact copy of that version of my project at that point in time. And the advantage there is I'm going to set this up as sort of a what you wouldn't do, right? If I had gotten feedback from my publisher and they said, your article is great, we accept it, I'm not going to then hand my laptop that has all my work over to somebody else. I'm not going to hand it to Sophia and say, hey, Sophia, isn't this great? I got my article published. Why don't you kick around all of my processes and data and see if you can improve upon them or see if you can verify something or alter it? I mean, that would just be crazy. But in the case of a container, you can create a cloud instance that is an exact copy of what you submitted and then somebody else can leverage their ability to add new data, to alter your code, but they're not going to change the container. And so you can see that this finally got ready to, finally launched, took about a minute. It can take longer if they lay fallow for too long, they have to rebuild them. But regardless, it's probably a lot quicker than if I shared my repository with Don and said, go ahead and rebuild a compute environment and set the packages to the exact versions I used and make sure you have the exact version of R, et cetera. This is just a much more efficient way to share a completely full featured clone container of my work. So if I, like for example, open up this R&D file, I can not only execute everything from that repository, which you'll see happen right here is it's running, but I can also, right, I can make changes. So I can, wonder why I said it was abnormally ended, but doesn't matter, mutate. Oh, this thing just crapped out on me. I've never had that happen before, but Murphy's law of, I don't think this is going to run, two equals cost times two. Let's restart R and run all chunks. Yeah, there we go. And so if I slide over, you can see that I just, you know, I'm creating new data. I just created this, this variable called foo based on the data that I was already sharing. Anyway, containers are super cool. There are a little bit of a hassle to set up, but not horrible and certainly highly functional. And so if you want to pursue that, let me know, and I can send you more information. And I think this last slide is yours, isn't that right, Sophia? Yeah, I got a couple more, couple more things to show folks to finish off the workshop. So maybe if you all start sharing my screen, if you want to stop, John? Yeah. Okay, hold on. All right. So as we think about these case studies, and John just presented one potential way to kind of do that end to end reproducibility, I want to show you a couple more, one more tool that uses a container on the back end to enable that cloud computational environment. And that's Code Ocean. So Code Ocean supports a large number of coding tools, or Python, and quite a number of statistical packages. So if you're using something proprietary like SPSS or Stata, those are also in MATLAB, I believe, are also supported by Code Ocean. And Code Ocean also is built in, it is by design, an archival repository. So they've tried to kind of bring together a lot of those different features. And it is also integrated with GitHub. So some of those other features for enabling reproducibility in this cloud environment, while also having some of the best practices in place for dissemination of code specifically. So I'm going to, sorry, I'm going to go over to Code Ocean. So this is an example from a Duke scholar. So one other, let's go back to this, this kind of story we've been telling about, you know, you publish a journal article and your publisher wants you to share your data and your code. There are a number of journal publishers that have kind of integrated with Code Ocean. So in this use case, let's say you're reading John's article and you're like, that's an interesting figure. I'd like to verify how they came to those results. And if they, so some journals have integrated with Code Ocean where you would just click on a button, it would take you over to what's called this compute capsule. And here you would be able to see and access both the code and the data as well as the environment that that original researcher worked within. So again, it has just like that Zenodo archival repository we were looking at. It has that metadata. So it describes kind of what we're looking at. It gives that the licenses that we were talking about, including the MIT license. So it's kind of minimal metadata, but it's really the essential pieces that you need for publishing your code and data. And then I've already done this, but if I wanted to, let me back up. So you can also see here, I can click through, look at there are code files that generate the figures. And they do have this master script, which John mentioned. You know, if you have multiple different script files to enable, you just run the master script and that's going to process and run in order the sub code files. So I've already done a reproducible run, but essentially with Code Ocean, it's really easy. You just hit reproducible run. I will note this is a tool that kind of has a freemium model. So anybody can use it for free, but you get kind of only a certain amount of what they call compute time. So it takes a certain amount of time. So if I click on this, it says if I want to rerun this capsule, it'll take me two minutes and 23 seconds. I think they give you 10 hours of compute time for free. So if you have like really big data or really computationally intensive analyses, then you know, this may have some financial implications. But for a lot of smaller projects, you would stay within that 10 free hours. And then I think you get 20 gigs of storage for free. So I already ran this. We didn't have to sit here for two minutes and watch it. But essentially, this is what the output would be. So we have all the code files. And here are the PDFs of the images that are included within that journal publication. So again, this is just another way for, as an example of a cloud environment that is doing, as John said, the more kind of state of the art computational reproducibility based on containerization of both your code and your data in that archival repository space, which allows you to publish, get your DOI, and do that linkage between the article and the data themselves. So just another useful tool to know about. I did want to mention kind of going back to this idea of our reproducibility spectrum. What John and I have been presenting today is one option, right? And as John said, this is kind of more of the state of the art workflow end to end, making sure that you're not doing anything, point and click, really building in version control, building in your scripting, using an open source platform like R, you know, creating, using binder. These are things that are at that far end of that spectrum of having your linked and executable code and data in the cloud where somebody can more easily approach it. I just want to note that setting that up is going to have some time and overhead, and they're not perfect tools, right? These are still, this is still very much, I think, some of the beta workflows that are being developed and are cutting edge, but there are still lots of ways that you can make your stuff reproducible without having to go to that extent. If that seems like I don't have time to build that in, take a step back and say, okay, maybe I'm using Stato or some other program already and I am, you know, generating my CSV and I put it into my program and I'm doing some coding there and then I'm taking it out and I'm still using Word for generating my report. There's nothing necessarily wrong with that workflow and there are still things you can do to make it more reproducible. And one of the big things is disseminating your data and your code and your supporting documentation. So this is an example from the Duke Research Data Repository where one of our key depositors has really built into his workflow what the deposition of that compendium, that data, that documentation, that code into, again, an archival repository. We don't have running on the back end docker. We don't have that fancy like compute in the cloud, but people can still get to this data. They can still see the code files. They can still, if there's sufficient information, potentially set up that compute environment themselves, especially for communities that have well established practices where they're all kind of using the same software, using similar packages. That may not be that much of an overhead to try to actually rerun the code itself. And if I scroll down, you can see we've got the metadata in place, and then we have the files. So you can still download the data. And they have a Matless has done in Matlab. They have their Matlab scripts for each of their figures. So I would say that this is kind of the bare minimum that you should be doing if you want to breach a goal of having more reproducible research is at least when possible disseminating your data and your code in an archival repository. And then that next step to getting to that executable environment and thinking about how you might approach that with tools like binder. Binder is nice too because it's open source. And I really do like the workflow that John presented going from our studio, which is open source to GitHub, to Zenodo, which is supported by CERN, which is a scientific community and project out of Europe originally. So that's a really good workflow. But again, if you're working in other tools, if you're working in another environment and you're like, well, that's not going to work for me. What else can I do? This is what I would encourage you to do is to consider this just sharing of your data and your code and having that really robust documentation in place. So that was my little soapbox on that. I'm going to go back over to the slides. And we also will have some time for a couple of questions. But again, if you're interested in kind of that just the data sharing piece, the Duke Research Data Repository is available to you. I am available to help you think about that. And John and I are both available for questions about on any of these topics and to think through how it might be implemented in your context for your type of research. We do have on these slides a few resources. We'll be sending out these slides to everyone who attended today. I'll actually put a link right now to the slides. We have just a copy in box that you can access. Another resource I bolded here is the tier protocol. So if you're at the beginning stage of I really, you know, this is this is a lot. How do I approach this for my project? I've never even barely done any coding. The tier protocol is a protocol to walk through how to develop a reproducible project. And I would again say that it's not as it's not a state of the art. It's more based on just foundational data management practices of building out your code vitals, building out your data, building out your documentation. So I think it's a really good first, if you've never done reproducibility, this is a good place to start. So I'll put the link to the tier protocol as well into the chat because it really guides you through step by step how to make that kind of compendium like John mentioned. And finally, finally, I know we are at lunchtime and I'm hungry, but I'm going to put a link to a follow up survey if people have any feedback for us. We'd love to hear from you. And again, we'll send all of this via email so that you have that there for for the future. So I don't know if there were any any other questions, but we have a few minutes. People have questions. And if not, thank you for attending today. And we hope hope to get to work with you all in the future.