 We'll go ahead and get started with some of the introductory information. First, thank you for joining us today for this workshop on reproducible research practices. Couple just quick logistical notes. Please do feel free to add any questions you have to the chat or use that for discussion during the lecture portion of the workshop. For any of you that are here for graduate student RCR credit, we're going to be distributing an attendance survey at the end of the workshop, so make sure you stick around for that. And to briefly introduce myself, I'm Sophie Lafferty Hess. I'm a research data management consultant within Duke University Libraries. So my role really focuses on helping researchers think about how they plan for data management as well as in care for their data during kind of that active research phase with an eye on potentially sharing and disseminating the data for the purposes of reproducibility or broader reuse. I'm part of my role related to data sharing and dissemination. I'm also part of the team that supports the Duke Research Data Repository, which is a platform that the libraries support focused on allowing open access to data and code, either underlying a publication or for other types of research projects being taking place here at Duke. So that's me. I'm going to let my colleague John Little, my partner in reproducibility, introduce himself. Thanks, Sophia. Yeah, Sophia and I work together both in the Center for Data and Visualization Sciences. I'm a data science librarian. And that means primarily, I teach a lot of workshops. My preferred tool set or tool for doing data analysis is R, but I try and teach that with a perspective of reproducibility. That'll come through today. And then the other thing that I do is I do a lot of one-on-one consultations with people who are having challenges with R. It's mostly around constructing the foundation to do the analysis. I don't do a lot of consulting on research design. And then with a team of other people in the department, we manage an open lab that anyone in the university can use, or you can use it remotely. And it consists of some generally more powerful computers than what most people's laptops are, which can help with any kind of data processing that has a memory bottleneck. Because a lot of times, if you're waiting for your computer to process, you could use our lab instead. Anyway, that's me. And I'll turn it back over to Sophia. Thanks, John. All right. So we'd like to start off with a brief land acknowledgement. So this workshop's not going to have a whole lot that's kind of focused on social justice, but we think it's really important to take a moment to honor the land in Durham, North Carolina. So I'm just going to read this statement. So Duke University sits on the ancestral lands of the Shikori, Ino, and Kutaba people. This institution of higher education is built on land stolen from those peoples. These tribes were here before the colonizers arrived. Additionally, this land was born witnessed over 400 years of the enslavement, torture, and systematic mistreatment of African people and their descendants. Recognizing this history is an honest attempt to break out beyond persistent patterns of colonization and to rewrite the erasure of indigenous and Blacks peoples. There is value in acknowledging the history of our occupied spaces and places. So we hope we can glimpse an understanding of these histories by recognizing the origins of collective journeys. So thank you all for allowing us that time and space to make that acknowledgement. So to give a little bit more detail, John already introduced our department a little bit, the Center for Data and Visualization Sciences. As we said, we have a team that supports different aspects of data driven research, so staff with data science expertise, GIS, and mapping data visualization and data management. So, and we're all available for consultations. You can just reach us at AskData at Duke.edu with any of your data related questions. And if we're not the right group to help you, we're also happy to connect you with others here at Duke. And as John said, a lot of what we do is also workshops. So on our website, and also say we'll be giving you guys these slides with all the links after the workshop, but on our website, you can find our upcoming workshops. I would say we've only got a couple left. We're really finishing out the semester. But we also have an online learning page. So where we have links to many of our previously recorded workshops, links out to slides, links out to other guides so that you can engage in more of that asynchronous learning at your leisure. All right, so to move on to our learning objectives for today. So we're first going to just generally engage with the topic of reproducibility. So what is reproducibility and what's its impact on research and scholarly inquiry? We'll then delve more deeply into some specific practices that can be implemented in your research projects to enhance reproducibility. And then we're going to be identifying some tools and resources that can help support you in these practices. All right, so we're going to start off with a brief icebreaker. So John, you want to launch our first poll that's about your coding experience. Okay, I'm going to go ahead and end the poll. If you haven't participated, now is the time, but almost everybody has and share the results. All right. So thank you all for participating in that. So you can see we've got pretty good kind of distribution of different levels or amounts of coding experience in the room. So I just want to recognize that we did not expect anybody to come in with a certain amount of coding experience or so we've really designed this workshop to focus more on the broad concepts and ideas and again, resources and tools that can enable more reproducibility. If you're interested in learning more about like a specific coding language or a specific implementation of these concepts, I will do a quick plug. We're going to be doing a workshop on designing a reusable workflow using specifically R and Git. And that's in a few weeks that John and I are also teaching. So, and that kind of builds on what some of the things we're going to be discussing today. But regardless of your experience, we hope that we can give you some some things to apply to your research practices. Okay, so we always like to start off with definitions. So we make sure we have that kind of shared understanding to build our conversation on. So the terms reproducibility and replicability can be used in slightly different ways by different disciplinary groups or domains. So just to start off with two definitions. So these come from the National Academies of Sciences, Engineering and Medicine. They put out a nice report on reproducibility and replicability. And they say that reproducibility is obtaining the same computational results using the same input data, computational steps, methods, code and conditions of analysis. Whereas replicability is obtaining consistent results across studies aimed at answering the same scientific question, each of which has obtained its own data. So I think the way I like to think about this is reproducibility is John goes and he publishes a paper based on his hypotheses with his findings. If I wanted to reproduce his results, I need all the details on his procedures and methods and steps of his analysis. And then I'd also need access to his data and the code he used to perform his analysis. So I can run his code with this data following those procedures and see do I get those same outputs. Whereas replicability is John publishes a paper. I would need enough details on his methods and his conditions of analysis and his procedures. But then I would go out and try to test that hypothesis by gathering my own data. In many cases, probably writing your own code unless there's kind of shared code base out there. And then seeing can I replicate those results. So I just want to pause and make sure that differentiation is clear to people. We are going to be focusing more in this workshop on computational reproducibility. But many of the things that we'll be talking about really are some of those other building blocks that can contribute to better replicability in research and scholarship. And I'm not seeing anything in the chat. So I think we're all on the same page of this. So why should we care about reproducibility? So I think many of us have heard this term, the reproducibility crisis. So nature actually put out a survey in 2016 where they asked scientists to do they think there is a reproducibility crisis and you can see over close to 90% said either yes, a significant or a slight crisis. There's been many studies over the last 10 years that have also found that our research isn't necessarily as reproducible or replicable as we would like it to be. I think this other table's noteworthy as well where they asked, have you actually failed to reproduce an experiment, either your own or someone else's. And I think it's noteworthy to see how high some of those percentages are not only for reproducing somebody else's findings but your own. So kind of building on that idea of how difficult it might be to even just go back and build upon or reuse or rerun your own code. A neuroscientist put out a challenge a couple years ago to other scientists that said, can you go back to your 10 year old code and see if it still runs? And suffice it to say there was challenges, many of it around things like missing documentation where they had to spend a lot of time kind of doing that detective work to figure out, well, what is the purpose of these files? How do they all interrelate to each other? What was it actually trying to do? What was this piece of code actually trying to accomplish? And then also struggling with obsolete environments. So really in the scheme of things, 10 years is a really long time when we think about digital technology and tools. So some of them got kind of creative in trying to meet this challenge. So if you're interested in this topic again, it's a fun read. I'd encourage you to check it out. Again, we'll be giving you these slides. You'll have all those links. But I also like this idea of reframing this topic, not necessarily as a crisis, but as an opportunity, especially as we think about kind of the pace of science and how far we've even come in the last 80 years as this quote's pointing out. And then we didn't even know about cancer over 80 years ago. So I like to see this more as a call to arms for all of us as scientists and those that support researchers and the research community broadly to think about how do we develop those processes and those procedures and those workflows and those shared norms that allow for better reproducibility. And again, not seeing it in a way that should cause panic, but that is kind of empowering to the scientific community. And I think that that positive frame is a lot a lot better way to look at this. Also, really in the end, while there are kind of these broader altruistic reasons and the rigor of our research that's good for the institution, good for our disciplines, really the biggest beneficiary for engaging in some of these practices is going to be your future self, because you'll be able to go back and understand what you do to build on your own research. And really feel like you had that rigorous framework for how you were performing your analyses. But this is also not to say that it's an all or nothing thing because just to be upfront, it does take more effort. It does take more time to get yourself organized and put these steps into place for your research to be truly reproducible. But again, it's not an all or nothing thing. So like this idea of the reproducibility spectrum. So in our current kind of scholarly culture in many ways, even though I think we're moving along the spectrum, you know, the main way we disseminate the findings from our research is through a research article. And that's not reproducible really in any way. As we move along the spectrum, it's really about sharing additional things alongside our publications. So maybe you put your code up in GitHub, or you share all of your code and data, or farther along is actually having that code and data in a LinkedIn executable environment where it facilitates reproducing those results. And then finally, at that far end is the gold standard of full replication. All right, so I've been talking for a little bit now and we want to give you all an opportunity to kind of engage with this idea of reproducibility and replicability with a couple case studies. So we are going to do a breakout activity. I'm sorry if people are kind of burnt out on breakouts, but we do think it's nice for you to be able to talk in a small group, not just listen to John and I talk. So what I'm doing right now, I'm putting a link to a Google Drive folder in the chat. So essentially, click on that link. When you open it up, you're going to see Google Docs named according to breakout room. So breakout room one, breakout room two, breakout room three. When you get put into your room, then open the breakout document according to your number. So if I'm in breakout room three, I'd open the document for breakout room three. Within there, you'll find a case study and a few guiding questions. So we'll have around 10 minutes for this exercise and just a couple minutes reading the case study, briefly introduce yourself and just have a short discussion on some of the topics there. I would encourage you if you have time to elect a reporter. We don't actually have a huge, huge group today. So I'm hoping that we'll have enough time to have a good little conversation afterwards on some of the things that you're going to discuss. So we'll go ahead and move on to go back to sharing my screen and move on to a few more of the practices for today. Sorry, got to get myself all situated. Okay. So we're going to use this kind of reproducible project pyramid as a mechanism to talk through some different practices for reproducibility and kind of how these different aspects of your research process build on each other. So at that foundation is obviously your data. Then you have kind of your version control and project management processes, scripts and literate coding analysis and visualization, and then the ultimate kind of reporting and dissemination. And then talking about kind of archiving as that process that goes throughout. So I am going to start off with what I think is pretty foundational practices. Sometimes they may seem sort of simple, but I think they really are, again, kind of that foundation that you build some of these other practices upon, and that's your general project or data management when it comes to organization and documentation. So we do just want to have some type of intentional plan in place for how we're going to organize our materials, ideally before you even start collecting any of your data or generating any code so that you kind of don't end up in a situation like this. So just a few kind of quick high level tips. You usually is recommended to create, you know, all of the research files for one project in one file directory. And then for reproducibility, usually there's a recommendation to organize your materials to separate out your data from your code and from your results. So into three different folders. And then always keep a copy of kind of your raw unaltered data and don't touch it. So one tip there too is you can put it into a folder and like zip it up and make it read only. Keeping track of obviously that exact version that you used in your analysis, because there may be sometimes lots of different versions of your data that you go through during that processing and cleaning phase. And then just developing your file naming convention and trying to apply that consistently. And just a note about sometimes you'll generate like lots of different files and that may have some sequence to them or relationships to each other. And then just think about scalability. So building in things like leading zeros there. For documentation, a couple quick tips again is to create a readme file. So if you're not familiar with readme files, they've come from kind of the software development world, but in the data sharing space, they've really been adopted more broadly. And it's essentially like a high level document that situates secondary users to your data. So that can tell you what's what data, what documentation, what's in your data package, how do the files relate to each other, what processes do you need to go through for reproducibility, documenting all those dependencies and environments, things of that nature. You may also document in your readme file or you may have a separate data dictionary or code book, but again, documenting both your context for your research, as well as the content of the data themselves. So I've seen this on multiple occasions where people can, they'll post data, but then it's just a spreadsheet with a header and a column and a bunch of values. And without that documentation that lets you know, what are those data, what are those columns or variables measuring? What are these value codes measuring? Really can't interpret that data in any way or use it for reproducibility. And then my last one here is comments or code. So we're going to be talking more about like literate coding as well as a method for documentation, but kind of the old school, but also very effective way is just to make sure you have those comments in your code so that people know kind of what those chunks do. And this goes back to that nature article, the 10-year challenge around reproducibility where some, some of the researchers noted, you know, they hadn't even commented their code. So they go back and it really is kind of gibberish. So all right, so we're going to do another quick poll right now. So John, if you could go out to the poll and we're going to ask again, so there's lots of tools that you might use for organization and documentation and your workflow. So we're interested in what people are currently using here. I think we're just about done here. So if you haven't voted yet, haven't taken the poll, go ahead and do so. I'll give you one or two more seconds and that should be it. I'm going to click end poll and share results. Okay, so it looks like we've got a pretty good distribution across all the different storage platforms and a good number of people also using get and get hub. So that is great, great to see. Again, we're, as I said before, we're not going to be focusing in on like particular ways to use specific tools. We're going to give you a few examples along the way, but the only thing I want to say about tools is that tools can be great. You just want to kind of do that assessment of is the tool addressing a need for my project? Is it helping to facilitate my process? Because as we all know, you know, learning new tools can always have somewhat of overhead. So just mapping kind of the needs to the functionalities or the affordances of that tool. So thank you for participating in that. I guess we can close that and so I just told you I'm not going to show you specific tools, but I actually am going to demonstrate just a couple. One is a conceptual tool and one is a project management storage dissemination tool. So project here comes, it was originally designed to teach students, graduate students, particularly integrity and empirical research. And it has it has a very specific kind of protocol and process that you walk through to build a more reproducible research project. And I'm going to show you that kind of in connection with OSF. So again, the tier protocol particularly is made for people that are just new to reproducibility and really wanting to practice these skills. I will say they also have a dress protocol. So for any of you in the room that, you know, are more seasoned researchers and, you know, feel like I've already got a lot of good processes, they have some other things that may just you might consider adding into your workflow. So I'm going to work really quickly through the tier protocol. It has three key phases. First, it's your pre-data phase, then your data work, then your wrap up. So the pre-data phase, and it's very explicit about kind of again taking that time and intentionality to get yourself organized by constructing a hierarchy of empty folders. So this is again that common structure that's recommended to have a folder for your original data that's separate from your analysis data files, a folder for your documents or documentation, and then a folder for your command files, which is your script files, your code files. Then they tell you to actually spend the time to go ahead and create three blank documents. That's your documentation file. So that's a metadata guide for any original data, your data appendix, which is essentially like the data dictionary or data code book I was mentioning before, and your readme file. Then you kind of get into your data work. And while you're doing your data work, you're going to want to be building in that time to both be documenting your original data as well as constructing that data appendix, and then writing your command files that process your data and command files that generate your results. So John's going to be talking a lot more about coding and its important role in the space of reproducibility. So I'm not going to go into too much detail there. And then finally the wrap up. So this is the important step where you make sure that everything is kind of packaged well together, your documentation is complete, and then if this is for purposes of reproducibility underlying a published article, you'd want to put it into a repository or archive where other people can access it. And again, we'll be talking about kind of examples of dissemination and what that looks like later on. So I am actually going to quickly, we do have a public kind of example of what an implementation of a reproducible package can look like implemented in the OSF. So I'll put a link to this actually in the chat so people can take a look. It is public. OSF is again, it's a platform that's essentially made for kind of managing the different pieces of your workflow. It integrates with lots of other storage tools and organization tools, so things like box or drop box. So it can be kind of like a project umbrella space for keeping all of your materials together and with the ultimate goal potentially of sharing them more broadly and disseminating them publicly. So essentially what it is is like this project page where you have a built-in wiki and then you have your file panes and then it also tracks all the recent activity that's happened in this space. So we've talked about the documentation piece. So this is kind of an example of a readme file and what it can look like. So we have all of the information on our source for our data. If you do use existing datasets, I'd always encourage you to have this kind of really full citation for the data because that's for one thing, it's giving proper attribution to the data producers and it's also kind of what's needed to make sure people can get back to that source data. And then we have outlined the contents of what's called the compendium. So a compendium it's a term that's now being used in the space of reproducibility to refer to that package of your data and your code and your documents and all those things that need to work together to reproduce the findings in an article. So you can see we kind of have these are essentially our folders with our documents, the original data, the command files, the analysis data, and then the outputs of the graphs. And then the second part is just essentially instructions on how to use this compendium to reproduce the results. And again, this is an exercise that the peer tier protocol folks developed to help individuals who wanted to kind of practice and work through this to do that. So this is openly available if you wanted to go through and try to try it out yourself. And for our implementation, we did this using a version 14 of Stata. So again, we just give information on how to walk through that process and what each of the different command files does. So Sophia, there's a question about OSF in the chat in case you want to. Yeah, does OSF help? How much data? Yes. So for private projects, you can have up to 50 gigs in a project for no, five gigs, they limited this five gigs in a project. But once you make it public, then the limit goes up. So they're really trying to encourage people. Sorry, they changed these numbers relatively recently. So it is, it's more limited for kind of the private projects. But yes, it's 50 gigs for public projects. Thank you for that question. And anybody who's interested in kind of OSF, it's used a lot, particularly by some of the social science and psychology communities here at Duke. And we do have our own membership. So you can sign in like using your Duke Net ID. And I'm one of our ambassadors for the OSF. So I'm always here to help people that are using it. One big way that people use OSF, this is kind of an aside, but it's for registration. So if you all are familiar with the idea of pre-registering a study where you register your methodology. And again, this is a big part of transparency and accountability and goes along with reproducibility of saying, you know, this is exactly the methods that I'm going to use to test X hypothesis. So this is one service that is really big in supporting those new practices around pre-registration. All right, so I just wanted to show you guys a couple other things here in OSF. And kind of the implementation of this tier protocol. One is the, again, the data appendix just kind of make this real again. So again, we have essentially every single variable that you would find in the data sets, a full definition of what that's measuring, a full information on the coding. And then tier protocol also encourages you to include a frequency table and a bar chart. So this is a really easy way to both potentially identify errors in your data is to have these frequency tables as well as to give secondary users a kind of picture of your data at a glance. One other quick thing, as I said, it does, OSF integrates with different things. So this is one of the ways to get around. If you want to use it for bigger data sets is to integrate it with another storage platform. So for instance, you know, we have 50 gigs at box, this integrates with box. So you can link a box account where you're storing some of the other files and then still have access to those files via the OSF project. So that's what I like about OSF, especially if you're working collaboratively. So if you're working collaboratively, then you have that kind of project umbrella that everybody can get to regardless of what projects or what what platforms they're using. So I see somebody saying, you know, I work 100 to 500 gigabytes of data. Yes, some of these platforms for bigger data are just not going to be a great solution necessarily. And I know that that's also a thing. John can talk about it, you know, that can be a challenge with Git and GitHub is working with bigger data sets. So thank you for that, Larissa. And I apologize that this might not be the best solution for you, but we can always talk about other things. And then the final thing I just want to show is again, you know, these are my coding files. And this, you know, we have our commented code where it outlines exactly what each of the chunks of code should be doing. All right, so that is again, I just wanted to quickly kind of walk through what all these pieces that I've been talking about might look like in a finished product. Again, this is just one implementation of what a compendium can look like. We're actually going to show you a couple other examples during this this workshop, but particularly focusing in on that organization and documentation. Seeing a slide deck, same slide deck halfway through, it starts a tale of two repositories. And this is just my way of giving a little bit of overarch to or overview to what I'm going to talk about. And really it's this idea that archiving kind of not kind of it should pervade the workflow of your reproducibility workflow. But archiving is not simple as simple as I'm just going to put everything in GitHub. I like to make that comment. We're going to talk about getting GitHub if you're not familiar with them. It's not as simple as putting it in box and drop box. There's also other kinds of repositories. For example, the one that Sophia is an ambassador for that's the Duke Research Data Repository. And they have different purposes. You can synchronize them. And I want to bring that out at the front and talk about that as I go along. So going back to this graphic, this pyramid, one of the things I want to comment on, and you'll see that the archiving is pervading all of these layers. I also want to comment on this graphic that it can look like the foundation stage is a first stage. And then you move to the second level, and then you move to the third level. And it's very linear. But I want to encourage you to think of this graphic as possibly having an elevator in the pyramid. Because we're going to go between floors as I talked for further. And it's not meant to suggest that in your research process, you won't ever readdress earlier layers. It's just meant to give you some kind of graphic to think about different aspects of reproducibility. All right. So I'm going to start off with this main suggestion about reproducibility is that when you approach your workflow, an ideal best practice is to automate everything with a script, right? Do as much as you can with scripts and code. In that way, the script becomes the orchestrator that sort of defines the steps that would be linear. But usually those steps as a linear process are not entirely clear until you get to the end, right? Until you get to reporting and archiving. But you want to be able to have that ability to press play by the time you get to the end so that you can actually reproduce the process. And that is best done with something like a script. So one of the things that we know as we move forward in a larger computer revolution is that GUI tools, WYSIWYG tools, things that require a lot of mouse control are very hard to document. Whereas command line interface tools, scripts and coding are much better at orchestrating the workflow processes of computational thinking. All right. So some of the things that you would want your script to ideally control would be the data cleaning process, right? So you're going to ingest this raw data, you may add in additional ancillary data, and then you're going to have to reshape that data. And in fact, in a research process, according to an article from the New York Times from about eight years ago, that data cleaning process can be as much as 80% of your project and often is. So that's something to pay attention to. The version control can always be managed with the script, but there are tools that will enable that process to be easier. I'll talk more about version control, the analysis and visualization, I'm going to make very few comments about that. Mostly the analysis is up to you. But of course, if you can manage that with a script, once again, as you get to the end of the project, and as that project is done, and then you maybe move forward in time. So think about moving forward six months from now, you look back at that project, it's a lot easier to tell the analysis flow. If you have a script, as opposed to, for example, and I don't want to hate too much on Microsoft Excel, but anybody who's done a semi-elaborate project in Excel, if you set it aside, and maybe you have to address it once a year, you go back and look at that Excel sheet. That's a tool that was designed long before this notion of reproducibility came to the fore, and it really doesn't have any reproducibility features to it. So you look at that old Microsoft Excel project and you have a very hard time figuring out what steps you took to achieve the results that you got in the Excel spreadsheet. Ideally, you can use these scripts to also create report derivatives, so anything from an article to a web page, and we'll talk more about that. Then, of course, archiving, which again may not be managed entirely by scripting, but you would want to be able to have the whole press play from step zero to step end archived in some suitable container. So I'm going to talk about this specific example right now, but that's because I'm an R programmer, and I quickly want to back out of that, because the first message I want to send to you is this is not about R, it's not about brand name GitHub, it's not about brand name GitLab, but there are those in that pyramid, there are generalities that are actually enables quite well, so that's a comparative advantage to using R, but you can use other tools. I'm not trying to say don't use Python. Python does many of these things as well, and they're two good examples of scripting tools that help you control the workflow. So a workflow could be using R plus Git, Git for version control plus, and those icons there represent GitHub or GitLab or BitBucket. You could use any one of those, you could in fact use all three, but the important part really is the version control piece, so I'm going to recommend Git. Git seems to be very popular right now, but there are other version control tools, and then in many cases ideally the features of a social coding hub like GitHub, but you don't even have to use GitHub, but that's the sort of central message that Git and GitHub, they're not the same thing, but they work really well together, and it helps you manage the versioning of your workflow. All right, so talking about GitHub, I saw in the poll that about half of you were using Git, and that is very encouraging to me. There are other tools in there like Box and Dropbox and whatnot, and again, there's not my intention to say that not using Git is wrong. Any kind of file synchronization tool is a good practice because you can roll back in time, but I want to highlight the features of Git and GitHub because I think that they're more robust and extensible and lend themselves to a wider array of reproducible workflows. To do that, I'm going to use Alice Bartlett's slide deck from a presentation she did at a workshop or a seminar back in 2016. Again, so the next 15 slides I lifted from Alice's slide deck that is about 100 slides long, and when you look at the slides, you can click on that link, Git for Humans, or you can Google it and you can look at it yourself. So Git, again, Git and GitHub are not the same thing. What is Git? Well, technically it's an application, so it's an application like a web browser or like a word processor, but I want to at least take a moment on that concept. Web browsers and word processors were designed to be easier, easy to use. They have nice interfaces and they oftentimes leverage the ability to use your mouse. Git is not like that. Git is an unfriendly command line interface tool at its core and that can present an initial stumbling block, but what it does is it helps you manage the workflow of your project at the folder level. And so we'll talk more about what that means, but at the folder level is a core concept. So that unfriendly command line interface then is unfriendly because it was designed for essentially for what the world's largest, most successful open source software project, which is called Linux, and that project, it spans time zones. It spans spoken languages, multiple languages. It probably spans other computer languages. I don't know the details of how Linux was created, but Git had to be a bulletproof way of supporting this worldwide collaboration and it had to create a situation where you could manage conflicts in that collaboration. And it does that, but even though it supports a very large project like Linux, it can actually scale all the way down to a very small single person project and it is that extensibility that can make the initial first time use of Git a little bit frustrating because again, like I say, it's unfriendly, it's command line driven, but the good news is there are lots of clients you can use to make Git easier to use and in fact, for a lot of people, there's really only four commands that you need to use at all. And after a very simple amount of repetition, those commands become second nature and you don't really have to think about them. But we're going to talk about some of that jargon and we're going to explore in a little more detail the five things that Git enables you to do. So thing one is, it tells the story of your project. It allows you to take a snapshot of a particular folder and that folder, once it's under version control, becomes what's known as a repository or for short a repo and those snapshots are all called commits. So the way that works is if you look at this picture and on the right hand side, but you look in the middle, here's an example of sort of a manual version control that is driven by changing the file name, right? So you have logo one, this is a picture, it's a vector graphic file, then you have a variation called logo-2 and then maybe if you're the developer of this logo, you send it out to your friend Monica for some feedback or your colleague Monica. Now you have logo-3. And then somewhere in there, you come to some conclusion that this is the final logo. And then Murphy's law of final logos, if you named it final, it's not going to be final. You're going to have final-1. And pretty soon, you're going to end up with a large collection of files that becomes difficult to determine which is the file that I want to use. And what Git allows you to do with version control is it allows you to keep the same file name and track all of those changes. And each one of those changes is a commit. What the commit does is it saves very basic information about when and who made the change. And then it gives you the opportunity to create a commit message that can give you the opportunity to share why information, why did I make this change? And then that change represents the state of the entire folder, not just the file. So in this commit message, which is in my opinion rather wordy but very helpful, there's nothing wrong with a wordy message. That can help you when you look back. It says researchers show that many people did not spot the links. So this is for a web page and they updated the links and determined that it performed better. All right. So two jargon terms. Repository or repo and commit. Those are two things. Knowing of what a commit can do, that snapshot and time allows you to understand thing too, which is time travel. So this graphic across the middle represents linear time. Each one of these dots represents a commit. And you can see here that it's in explanation, the commit is associated with a point in time. In this case, what is that? May 20th, June 20th, and September 20th. And then a brief commit message. Right. So each one of those commits has a unique identifier associated with it. That unique identifier is called a hash. Depending on the tool you use, you'll never have to address the hash at all. But if you had to address the hash, you would have this really long string, alphanumeric string. In most cases, you can get by with just the first four characters. And you can use that hash then to go back in time. So in this example, let's imagine that maybe each one of these commits is once a month. Six months ago, let's just say or six years ago, back in 2016, you deleted an icon that you want back. If a time seems superfluous, but now you kind of remember it more fondly and you wish you hadn't deleted it. Well, in a version control repository, even when you're making deletions, they never really disappear because you're trying to tell the whole story of the project. And so it disappears from your visual presentation of the file system. But if you then go back and mine the commit messages, you can see, oh, back here on the 20th of May in 2016, I deleted an icon. And then you can go back to that point in time. And it'll make all of the places forward kind of become invisible or more to the point. It allows you to look at this point in time as if that is the present. And then you can manipulate the state of your project at that point in time. Okay. Thing three, which is really great is get enables experimentation. Right. So in this graphic, we have a main repo. And that's the thing that I might through managing my permissions. That's the thing that everybody's going to see. But then I'm this could be a number of things. It could be a webpage. It could be a thesis that you're going to turn into a committee and an advisor. It could be an article. Let's just say that you finished it. And this is what you want everybody to look at here in the main branch of your presentation. But now that you're done, you have a little time to imagine variations on your project. But you don't really want to mess up or fool with that thing that you're about to turn in. That's where branch comes in. Because with a branch, you can make a copy. And then you can do some experiments. And you can see that in this particular example, they've moved along five commits and made five changes to the project. But effectively, there are two versions of the project out right now that I as the editor or author can see. And this is the one I'm going to turn in tomorrow. But this is the one where I'm just imagining what if what if I did these other things to my project. And the good thing about branching is there's not a limit. So in fact, I can do this wild experiment here where I turn maybe all of the word the bold words in my thesis. I also want them to be bright pink. So I put that wild experiment out there. But it's just something to see, well, what if I did this? These branches don't have to do anything, right? I can go back to this main last entry in my main branch and I can keep on going and make no reference to these branches. So those branches in that case become orphaned. But if I like the work that I did, for example, in this blue add new styles branch, I can merge that back in to my main branch. And each one of those commits just gets adopted into the time flow. And now this is my new version. So maybe I've decided, in fact, this is better. This is what I'm going to turn in tomorrow to my committee. Or whatever, obviously the time is linear and it just keeps on going forward. But if I had to go back and refer to this point, because maybe I made a mistake in here that I didn't realize, it would be easy to go back and make this the time period that I then branch off of and go forward. So branching is a really, really excellent feature. And yet I use it a lot in the workshops that I do because I do several workshops repeatedly each semester. So oftentimes at the end of a semester, I'll make a branch for the next semester and see what kinds of changes I want to make. And if I like the changes, I'll adopt those into the main branch. All right. Thing four about Git is it helps you back up your work. So everybody knows that you should have backups of your work, ideally in two distinct locations. But it's sometimes not so easy to do on a PC computer workflow. If you're using Git, it's just a natural part of the process because you can create what's called a remote, which is a copy of your, a cloned copy of your work that's in Git. And then you push it to a remote, which in this case is most popularly something like GitHub or GitLab or BitBucket, which are entities in the cloud, for which you can maintain the permissions. But then you have two distinct copies. And the other things that can happen in that remote are things like project management, file distribution or project distribution, group collaboration. There's a whole bunch of other tools in GitHub and GitLab and BitBucket. And so most people who use Git will use both, but you don't have to. If you're under some privacy constraints, you can skip the GitHub thing entirely. Depending on the network on your own, you can run local copies of these tools, GitHub, GitLab, BitBucket. In fact, Duke runs a local instance of GitLab. So maybe you can manage some of your privacy constraints with a difference installed. There's no reason why these things can't be installed on more controlled networks. All right. So let's talk a little bit about backing up your work and the other affordances that happen when you put a remote into GitHub. So I'm going to introduce these three characters here. I'm going to call this character Terry, I'm going to call this character Finn. And then this is my GitLab remote. So Terry has a project that was created and pushes that project up to the remote. And then we get Finn involved. And in order for Finn to be synchronized with Terry's two copies, we're going to do what's called a clone. Or specifically, Finn is going to clone. And so what you can see here is they all have the same synchronized versions of the project. And then Finn's going to make a change. So you can see that Finn's branch is one commit ahead, but it could be several commits ahead. In either case, eventually, Finn is going to push those commits back to the remote. And then you're going to have a situation where the remote and Finn are synchronized, but Terry is at least one commit behind. And so all Terry needs to do in order to synchronize is do something called a pull. Okay. And not only is that an example of having multiple copies of your project, right? In this case, you now have three copies of the project. So that works well in case of data loss disasters. But it also, once you put it up into that GitHub remote, it has all of those features of collaboration beyond just file synchronization, like wiki boards to help you do documentation or Kanban planning tools like Trello to help you do planning or issue trackers to help you figure out what needs to be done next. All of those things get built into those kinds of tools. And I do want to contrast in one way, as I said, I was going to get from tools like Dropbox and Box, which are very nice tools, but those are file synchronization tools. So while you could take a document and roll back six months quite easily, what would be harder to do is go back six months and see, well, what are the state of all of the files in that project from six months ago? And that's something that that extensibility component of Git makes very, very easy. If you do that with GitHub, it becomes even easier. All right. So as Sophia said, in about two weeks, she and I are going to do a workshop specifically on the hands-on very practical situation of how can I use Git with our studio and GitHub? What things do I need to configure? And so we'll get over that. Like I said, it has an initial unfriendly learning curve. And so we'll offer this workshop as a way to help you get beyond that. I do want to note to somebody who asked a question earlier. I think it was Larissa who asked a question about data sizing, that there are there are extensions to this whole process like GLFS. And there's an OpenSci project called Piggyback that enable you to store your data in other places and still take advantage of the Git version control features for your reproducibility. Now let's talk about data. So that first layer of that pyramid. John, we have one other question about Git before you move on about whether there's local Duke versions of Git on the PRDN or plan, PLAN, which I'm not sure if I'm familiar with PLAN, but in the secure computing environments that are supported by Duke. Do you know if there's instances? I don't know if there is an instance of GitLab on the PRDN. I'm going to go back. I'm going to say two things. One is you don't absolutely need to use those social coding hubs like GitLab to use Git. So you can install Git for sure on the PRDN. And whoever the network admins and administrators of the PRDN are, I think that's a good question to put to them. I can certainly work on answering that question in my outside of this workshop, but I don't know the answer right now. But I can't think of a reason why a tool like that could not be installed within the PRDN. So it should be possible, but somebody in OIT would have to administer that feature. And if it's not there, then it's simply a matter of researchers asking for it. That's what I think. If somebody knows otherwise, please correct me. And like I said, I don't have the answer right now. So after this workshop is over, I will try to verify that because I just feel like absolutely it should be possible, but should be doesn't always mean that it is. Okay. So in this slide talking about data, I just want to talk about some best practices and reproducibility practices and ideas that you want to consider when you're when you're managing your data, that first layer, one is that you want to generate and manage the data with code. You're going to hear me say that a lot, manage it with code. And so you're going to use your code to not only ingest the raw data, but to do the data wrangling, the due to the data normalization, and to do the data merging with ancillary code that maybe does not come from your experiment, but helps you. So definitely do that. Once again, you're going to want to protect any personally identifiable information. So you're either going to be on the PRDN or you're going to do some scrubbing of your own. Certainly when you're using coding tools, scrubbing certain things like net IDs or social security numbers becomes quite easy. It's just a small bit of code. Once again, a comment on large quantities of data, voluminous data, sometimes there are good reasons to put that data in databases. And I don't want to suggest that anyone should not use a database. Databases are incredibly helpful tools. Coming out of an era when disk space was much more expensive, relational databases in particular had one feature that allowed you to compress the data so that you didn't have too much redundancy and you were using small amounts of data. When we flash forward to 2020, those issues of rare and expensive disk space are largely gone. It becomes expensive because we have more data to use. So you may in fact want to use a relational database or some other no SQL or other kind of database. But I want to note that the more you use a database, the more complexity you have and you need to make sure that you attend to the administrative responsibilities of the database application as it reflects your data. And so there are cloud tools and whatnot. I just want to note that the more complexity surrounding your data, the more issues you have in terms of data loss. So you definitely don't want to just willy-nilly put something into a relational database thinking that that relationship makes your data better. In fact, as is often the case with reproducibility, the simpler the better. So a container like a comma separated value is going to data file is going to be better than an Excel file and may in fact do everything you need where you don't actually have to use a database. All of these things are going to depend on the realities and specifics of your project. So I'm going to talk a little bit about tidy data in the next slides, but it's certainly something that I recommend coming out of the r slash tidyverse world. There are other ways to manage your data. But you do basically want to have your data wrangling and your data conform to certain reproducibility best practices. Good tools for managing data wrangling include open refine, which is a free open source tool that has reproducibility features built into it. And of course, any kind of programming or Python less so tools like Excel. All right. So I want to just briefly talk about tidy data as one opinionated container that works really well for lots of kinds of data doesn't work for everything. If you want to read more about the academic articles on tidy data, you can follow these two URLs. But essentially the concept is that every variable is a column and every row is an observation. And the intersection of that row and that observation are going to give you your data values. Now the idea behind tidy data basically is that you're going to have long data. So there is going to be a lot of redundancy. And that's going to enable iteration in a more efficient manner. A perfectly legitimate alternative to long data is wide data. And most all the data manipulation tools I'm aware of have the ability to pivot from long to wide and back and forth. So sometimes you're going to have to change the shape of your data. If there are Python users in the audience, many of you will be familiar with pandas, which basically comes out of this idea of panel data, which is essentially wide data. I'm not going to tell you that I know enough about your project to tell you that you should use tidy long data. Maybe panel data is the better tool. I just want to bring out some of the features and note that your data should have some discernible semantic meaning in that data frame so that if it gets separated from the code book in some way, it might be possible to reconstruct what that data is used for. I also want to contrast the idea of tidy data from untidy data by just showing kind of a common Excel spreadsheet container. And what you'll see in this container is really a mixture of data as container and data as a report, right? We have a header up here. We have a table here, a summary table here, a summary table here. We have some combination of wide and long observations going on here. And I haven't really looked at this too much to break down exactly what's going on here. But your computer doesn't really want all of that extraneous summary and colors and whatnot. Your computer is far better at picking out data patterns than we are. So the idea would be to not constrain your data with all of those features, but to put those reporting features into your report. Talk a little bit more about that later. I just one more time want to just note this is an opinion. I recognize it's an opinion. People who have spent time in manipulating data, the tidyverse, will all recognize that and say, yeah, this is not a one size fits all recommendation. It's just a recommendation that builds on the notion of reproducibility and reproducibility best practices. And we think that it works well for many, many projects. All right. Going back to that analysis, that pyramid, there was this issue of analysis. I'm leaving that entirely up to you. I'm not going to comment on how your analysis should be reproducible. I just want to talk about the other aspects of workflow that surround your project. And as you'll recall, I mentioned of best practices to do your work with code. And so now I want to talk about this concept of literate coding. And what happens in literate coding is you're combining your pros or your natural language with your code and your visualizations. I'll show you an example of that in a minute. Excuse me. But the idea is that you can have multiple code chunks. And in between those code chunks, you can use just standard language to describe what's happening in the analysis. So you can include things like executive summaries and then use the structure of something like markdown, although there are visual tools to clarify the structure of the document and to explain not only the analysis, but maybe explain the data so you could possibly incorporate concepts like a data appendix or a data code book. And one of the advantages to doing it this way is that since you're using natural language, you're more likely to fully explain what's happening in the workflow. And I'm going to contrast that with the entirely legitimate but old school way of commenting your code where, you know, the code always begins with a hashtag. And what we know about that process is it tends to promote cryptic explanations because it's not an easy way to write or compose. But it's better to use hashtags and comments in a standard script than it is to not comment them at all. I just want to introduce the topic of literate coding to you. Because among other things, not only can you have all of that explanation and analysis together, which allows you to run something from point zero to point end as a single project, but it allows you also to take advantage of change. So if data changes or get new data, you can run that script again, and then you can render different outputs, depending on your needs. So that rendering of different outputs, those outputs could be slides, they could be web pages, they could be PDF documents. That rendering of different kinds of outputs from literate coding, in my opinion, was a little bit easier to do, maybe a lot easier to do in the R context than in the Python context, but either way, it can be done. So here's an example of a literate coding notebook. This is a Python notebook. You'll see some structure to a first level header here. You'll see some pros here. There's pros here. You can integrate with LaTeX. You can integrate formulas that explain your analysis. This first grade code chunk and this second grade code chunk are examples of that interspersing of explanation with analysis. And then also inline outputs, visualizations can all be part of that notebook. And this can be done either in Jupyter notebooks or R notebooks. And then, like I said, once you have that all together, you can render different kinds of outputs depending on your needs. You can render the journal article first, then render slides from the exact same analysis. So the less you have to readdress the analysis, the less likely you're going to make typo mistakes in your reporting. All right. Transitioning yet again to the visualization aspect. Again, a recommendation about using code tools to generate your recommendations. This comes out of a book that was written by a guy named Leland Wilkinson. Wrote it back in 1999. He actually just passed away in December. Very influential book called The Grammar of Graphics. And that book describes in detail how you can use grammar to describe your visualizations. Three good examples of visualization tools that have been built on the grammar of graphics concept include GG Plot 2, which is mostly used in the R world, Altair, which is mostly used in the Python world. Either one of these actually can be used in the other. Tableau, which is a standalone visualization tool that actually has a lot of reproducibility features built into it, also has a notion of the grammar of graphics. And the concept here is that if you can have this grammar that allows you to describe, for example, in this scatter plot, allows you to describe things like the data elements and how to get colors based on categories and the scale of the y-axis and the y-label and the scale of the x-axis and the x-label and the legend. If you can describe those with a grammar for a scatter plot, then you can use that same grammar to describe a line plot or an area plot or to add layers on top of a bar plot or even to make a core plot. And once you have that grammar, that means that you don't have to learn different packages for different kinds of visualizations. Now, there are always exceptions. It's possible that you run across a different visualization, but this will benefit at least in two ways in that it limits the number of applications you may have to learn because the grammar is extensible and can produce many projects, many graphics. It also lends itself, again, to the process of reproducibility because it's a code-driven application. All right, I've actually kind of already talked about this, but one other aspect of literate coding is that ability to generate multiple and various kinds of reports from your analysis script. Anybody who's interested in those kinds of things, I think I have a workshop on R Markdown and I definitely have a, if you look in our online learning section of the CDBS webpage, you can find a video of a R Markdown workshop that goes about that and explains that in detail. All right, nearing the end here, and I just want to check time, actually not, I still, I think I'll have about five or six slides. Let's talk about archiving. So I spent some time talking about Git and GitHub, and it's a, that's a way to create, to use a code repository. And I want to note, if I didn't say it before, that code repository using Git is not just for code, right? Git allows you to manage the version control of all of the files on your file system. There are some really large files that it won't work with as well. But generally speaking, for most projects, it'll manage a PDF file and a Word file and a CSV data file and a script, all sitting right next to each other in the same project. It's just fine. But again, that code repository, along with that social coding hub, that's mostly for file distribution and collaboration. What it's not aimed at supporting is the cycle of scholarly communication. So that formal process of generating articles, for example, where you have these time periods, these milestones in your research project, where you're formally expressing what happened, what are your findings. You can certainly capture that in your Git slash GitHub repository. But the thing about GitHub is they're not really promised, they're really not all that interested in the cycle of scholarly communication. And so they're not promising any longevity. And what that means is, while unlikely, the next Wall Street acquisition could have Netflix, Biden, GitHub, and then who knows what would happen. I mean, it's a really unlikely scenario, but it's worth noting that you don't want to pretend that GitHub is an archive repository. You actually need to use an archival repository for those purposes. And for example, there are lots of archival repositories that are disciplinary archival repositories that are often associated with academic societies. They're institutional ones like the Duke Research Data Repository. Sophia can help you figure out which repository might be an appropriate target for you. But those repositories are aimed at preserving the milestones of your formal academic progress. So think about this, what you've got in your GitHub might be, there might be a point where that's a milestone that you cut off and you say, okay, this is version one of my research, and this resulted in an article. You can synchronize that very simply with archival repositories. And then that code as a compendium, so it would be your code plus your visualizations plus your data into a compendium, that code could then get ingested into an archival repository. And you could associate that with a digital object identifier, which would be a unique identifier for the code, as well as you can have a digital object identifier for the article. And the data, and you can associate that with an orchid, which is an author ID, and all of that makes it easier to pull back together or to attribute in the cases going forward. So archival repositories are super important. And then the last thing I want to mention is that if you had all of that stuff in either kind of repository, you could then go do some reproducible experiments or some even attempts of replication by gathering that person's work from that milestone. But there's a challenge here that it presents to you, which is then you then still have to reproduce the compute environment for when that milestone was done. So the more documentation you can provide in those archives, the better. But here's an example of what I'm talking about. Suppose I wrote and published an article in 2016. And I put that into a code repository, an archival repository. If I go back to replicate that, I need to make sure that I have the R version that I used in 2016 and the package versions that I used in 2016 in order to produce the analysis. So a response, that's totally possible to do. It's not terribly strenuous, but it is a task that you would have to attend to if you're doing, if you're trying to investigate the reproducibility of a previous project, you have to recreate that compute environment. So a response to that, that is not science fiction, that is possible right now, is to use these things called containers. Many people have probably heard of things like docker or singularity. Containers are used for more things than just what I'm about to talk about. But what a container is, is a zero install cloud-based compute environment. So imagine I did all my computer work on a laptop back in 2016. And then when I hit that milestone, I took that laptop and I stuck it into a computer freezer. And now it's frozen. It hasn't changed. It's still working. It has the operating system and programming language and the packages and all of the code. It's all in there. And then you could check out that frozen computer and do work on it. And when you're done, push it back into the freezer and it assumes or goes back to the state that it was when I left it there. That's essentially what containers do. They allow you, you can assign the permissions again. They allow you to check out a cloud version of a computer environment that has the code and the data and the scripts. And then you can introduce new data or you can alter code and see how things change. But what you can't do is change that point in time when I ingested the cloud and when I ingested the compendium into the container. So containers are super, they're really neat. I've set up several. They're a little bit fragile because they are still so new. So I set up several containers last summer and I don't know what happened but at the moment none of them are working. The good news is that the archival repository and code repository are also working just fine. And I'm sure that if I go into troubleshoot what's happening, it's going to take me 30 minutes or so to figure out and it'll be fine. What I anticipate with containers is that as we move forward, containers will become more stable. So if that's not of interest to you right now, just at least know this concept being introduced to you so that you can take advantage of it in the future. All right. So related topics. We talked about DOIs. You definitely want to have digital object identifiers to the extent possible so that people can reference the different components of your project. If you don't already have an ORCID ID, let us know. We can help you get one. But basically it's this idea that you have a unique author ID that you can associate with your work and then it's a little easier for people to disambiguate your work from others. So for example, if you have a name like, I don't know, Eric Smith or even John Little, it surprises me that there are lots of John Little's out there who are publishing. It's easy to distinguish me from others by using an ORCID ID. And then again, you can use things like containers with binder or singularity to publish in the cloud container of your work. Now I want to briefly talk about licensing. It's been very common through the history of copyright and scholarly publishing to yield over or give your copyright privileges over to the publisher. But there has been a move afoot over at least the last 20 years and it's gaining scene of using different kinds of licenses. There are three articles right here, links, that will help you figure out how you can choose licenses. Also I want to note that the library maintains a department called the Department of Scholarly Communications, which in the past has had as many as two actual JD certified lawyers who are experts in copyright and they won't actually give you legal advice. They don't serve in that capacity, but what they can do is help you understand the copyright and licensing factors that will influence how you want to go forward. But what I'm going to do is I'm just going to make mention of three ways of licensing your work that support openness that are not copyright. I'm just going to bring those to your attention. These articles will tell you more different examples, but if you're sharing code, it's oftentimes recommended to use the MIT license for software. It's a very portable license, it's widely understood, and it basically says no warranty, no guarantee. You can use my code, but if your computer breaks, it's not something that you should be talking to me about. It's your computer's problem. That's for software. For documentation, so everything from the code book to a formal article, you can actually license things with a Creative Commons license. There are several derivatives of the Creative Commons license, so CC by simply says anyone can share this work, but please attribute me by for authorship. And then CC zero is a way of licensing your data set so that other people can use your data set, but that you can attach attribution to the data set. So those things are very important. I know we covered a lot of stuff. These concepts on the left all have tools and implementations on the right, and you can review those as you see fit. And then I'm going to turn it back over to Sophia. I think to. Yeah, so I think, John, if you want to stop sharing and I can share, I just put a link in the chat to the attendance form. So please do click on that and fill out your attendance. What should happen when you click on that link is essentially you'll be asked to log in through Shibboleth, and then you will. Then it'll ask for a couple pieces of information, and then you should get a confirmation email. Sorry, I was trying to do too many things at once, and my brain was not processing it. Okay, so now that we have done the attendance, I had just like a couple final closing pieces of information or case studies that I wanted to present to you all. So first to return to this idea of the reproducibility spectrum. So as John said, we've thrown a lot at you in this workshop and just to again recognize that I think part of this is starting to approach your work with that reproducible mindset. So if you're at the very beginning of learning coding languages or just really trying to grapple with, well, what does it mean to make my work reproducible? I would say, start to think about building in those essential organization and documentation practices, and then build in more of learning coding languages, figuring out how you're going to share your code and data, and keeping an eye on containers. And some of these more advanced processes around literate coding techniques or using things like Jupyter Notebooks, and maybe for some of you that are already using Git and GitHub, it's about thinking about your archival strategy more consistently or starting to explore containers. So wherever you are on that spectrum, I just want to encourage you to think about those small steps that you can make and think about engaging in these practices in an incremental fashion because it can seem a little overwhelming at first, and I know we all have lots of things that we are trying to accomplish in our scholarly careers. So just a note on that, a couple other quick things where I wanted to show you kind of what implementation can look like across this spectrum. So to give you a couple other examples. First, I want to show you one in we've mentioned the Duke research data repository multiple times. So this is a data set within the Duke research data repository that I would call as one of those compendium. So it is underlying an article publication. And you can see we have all that key information that John has just talked about. So we have our metadata that's used for discovery of these materials. And he mentioned a DOI. So all data sets get that DOI. So he said that snapshot in time of what the data and code look like underlying this article that is still actually in the process of being reviewed. So this is fresh off the presses here. But you can see they have a link. We have what we have their README files. You have that high level documentation. They have a folder that has all of their scripts in it and then a folder for all of their data. So this is a relatively simple but very powerful in my opinion way to do computational reproducibility at its most fundamental level is putting your data and your code out there, right? And like I said, we have a tool here at Duke to help you. We also have support staff to help you. So myself, if you decide to submit your data and want to make it publicly available, we do have a curation service. So we take a look at your data and your code and, you know, see does everything look good? Can we understand the README file and may have some suggestions by putting that second set of eyes on the material? So you're not going it alone if you do go through the Duke, the Duke service. Another platform that I just want to mention that kind of is again on that continuum is CodeOcean. So CodeOcean is a platform that kind of has coupled some of the archival features of a repository with that computational environment by building it on a container. So it's built on Docker. So essentially, again, this is a dataset with the code and it has that metadata file that I was saying that describes what you're looking at. You have your DOI, you have your citation, you have your licenses. So all that essential metadata, that's kind of what's required for following some of the best practices for data sharing and publishing, particularly when it's underlying that article, right? So you have that stable version. But what CodeOcean does is essentially it has that computing environment running on the back end. So this article used R. So they have all of their R code. This is their R code for their figures. And then they have one master script that you run. So I've done this run before previously. But essentially, you can see here, you can run, you do reproducible run, and then it runs these files, these coding files with alongside the data in the computing environment that has been configured by the authors. And so you run it and then you get your outputs, which is essentially all of these PDFs here on the side, which would include, you know, the graphs and materials that are included within that final published article. So another nice thing about CodeOcean is that they have been working to integrate with some of these other computing environments. So things like running these, these things in RStudio or in Jupyter Notebooks. So again, this is, this is a platform that's still being developed again. There are some caps. So they've got a premium model as well. We are not a member institution of this right now at Duke. But just as a researcher engaged in a, as a researcher with a home and an institution, you get a certain amount of free storage space. I am blanking on what that is right now, but I can always talk to you about it later, as well as a certain amount of compute time. So again, this is just another implementation of what a compendium can look like. But it has that computational environment on the back end. Okay, so I think that that is all that we have for you all today. Again, we just wanted to show you kind of what this can look like in practice. I also want to acknowledge the people that we have built some of this work on. And also Alice Bartlett, as John mentioned, we'll see we've relied upon others great work a lot while developing this workshop. And then we have some resources. So that is all that we have for you all today. I want to just thank you for attending. Again, if you have any questions on these topics, you can reach John or myself at askdata at duke.edu or feel free to reach out to us directly. I will be sending a follow up as well to everyone with a copy of the slides and a copy of either our previous workshop where we've done pretty much the same material or this workshop, whatever John and I decide on that front. But you'll have access to these materials afterwards. And again, if anybody had any issues with the attendance survey and you're here for our CR credit, please do contact us soon because that's the way we do that reporting out to the graduate school. So thank you all again and I hope everyone has a lovely day.