 Hi, CNI 2022. Thanks so much for watching. My name is Vicki Rampin. And I'm Sarah Nguyen. And we're here to talk about WhoVersion's Scholarly Code. The work that we're about to present was done as a part of a project called Investigating and Archiving the Scholarly Get Experience, or ISAGE, which had two main streams of work. The first was a behavioral study around how academics and folks in academia more broadly are using Git and Git hosting platforms, and how these tools can be better aligned with their needs, which differ from traditional software development. The second stream of work was around evaluating the extent to which scholarship on these Git hosting platforms is being preserved by professionals. So we're going to present the results of the behavioral study for this. But just to make sure we're all on the same page, we're going to go over a few definitions. The first being, what is Git? Git is a revision control system that lets you compare, restore, merge changes to plain text files over time. It's an open source command line tool created in 2007 and is definitely the most widely used version control system both in and outside academia. This is a really important tool for collaboration and transparency in programming. This is a really amazing XKCD comic that kind of captures the spirit of Git, where this guy's saying, this is Git. It tracks collaborative works on projects through a beautiful distributed graph theory tree model. Cool. How do we use it? No idea. Just memorize these shell commands and type them to sync up. If you get error, save your work elsewhere, delete the project and download a fresh copy. So this is the kind of behavior of why we were interested in studying Git because this seemed to be so common. The other thing I mentioned where these things called Git hosting platforms, and these are basically places on the web that host Git repositories and may add some features on top such as social coding features. They're not the same as Git. They're typically commercial platforms, but like I mentioned, they are places where people commonly upload Git repositories to work on with others to make publicly available and again, to take advantage of some of those additional features like allowing strangers to contribute to code, issues, which are kind of a mix of to-do lists and discussions. And the most popular Git hosting platforms are GitHub, GitLab, Bitbucket, and SourceForge. So the other thing is, what is scholarly Git usage? What do we mean when we say scholarship on Git and in Git hosting platforms looks a little different from the traditional software development that we see? Well, the first example of scholarly Git usage I can show is probably the most obvious one, publishing code and data as supplementary materials. So this is where someone might create a Git repository that might have some scripts, maybe some data depending on the size, and they publish it alongside something else to encourage replication. Another great example of scholarly Git usage that we're talking about are quality assurance workflows for data analysis. So this figure comes from my favorite preprint that is all about building workflows to do QA and QC. So that's quality assurance and quality control on data using GitHub and Travis CI, which is an application that works on top of Git to add some computation to a workflow. So they do things like run automated checks for their data as a part of these half automated, half person reviewed workflows. Another example I can show is journal infrastructure. So people are actually doing entire peer review cycles on GitHub. They are publishing journals that run entirely on GitHub. The most popular one I think is the Journal of Open Source Software where they have a repository specifically for reviews for the articles and they get DOIs, they get archived in Zenodo, these articles, and the actual peer review is on software itself. So it's kind of scholarly Git on a few different axes. So in addition to the code, we're also interested in some of these other pieces that make up the scholarly Git experience. These are what we've called scholarly ephemera. In this case, scholarly ephemera are rich material which provides context to the source code development process, including content that provides insight into the genesis of a project, communications between its contributors and collaborators, and the procedures and interactions that brought it into its most current state, the repository to its most current state. These artifacts are important when understanding the history of a repository, the relationships between repositories, the branching and networking formation of repositories, and the actual contributors themselves on why they did a thing, why they chose to make certain changes, and the ways in which this information could be used to track derivatives of current work. So these include things, as you can see here, like branches and commit logs for the actual version control, issue trackers, which again are those discussion boards mixed with to-do list, pull requests, which are when collaborators from outside your project want to contribute code. They have wikis and lots of documentation features as well, and all of these contribute to a rich boon of scholarship that takes place, both in the Git data format and on Git hosting platforms. So why do we care about scholarship on Git hosting platforms using Git? Why are we talking about this? Well, there's a lot of it out there. These added features are really necessary, and GHPs have no preservation plan and do not guarantee to keep materials around on the platforms. So as evidence for my first point, there have been some wonderful papers estimating the scope of scholarship that is held in Git hosting platforms. How to bring it all, estimate over 5,000 GitHub software repositories have been identified as research software. So that means that they have been referenced by a publication or the software repository itself references a research publication that it's associated with. And the majority of those are actually coming from life sciences as well. So it's kind of an interesting look at that scope. In terms of the risk, because these are commercial platforms, there's lots of risk associated with loss. So one most recent example of where folks in academia lost access to their Git repositories was when US trade control restrictions came into place. And folks from six countries woke up without access to their Git repositories on GitHub. GitHub had this yellow banner that said, please read about GitHub and trade controls for more information. You can file an appeal, but your account has been restricted, mainly on the basis of being associated with these six countries. We saw PhD students on Twitter like this one begging for access to their source code for their PhD projects. GitHub was actually taking money from people at the same time as flagging their account. So we've seen that it's very easy to lose access to very important scholarly materials that are hosted on these platforms. So basically the impetus for our project was the realization that researchers are using all of these incredible tools like Git and Git hosting platforms and all of the ephemera that's on top of these Git hosting platforms that contextualizes it, that makes it richer. And there's no current project underway that archives both the source code and scholarly ephemera. So we wanted to look at the behaviors of scholars on these platforms to be able to affect greater archiving of their work. So first we had to ask the question, who versions scholarly code? And we did that with a triangulated approach where we did three focus group sessions with 12 minimal users. So those are folks who rated themselves as not very expert with Git or GitHub. We followed that up with a broad survey that had 54 questions. We ended up getting 358 usable responses when accounting for our poor inclusion criteria, which said you currently work in academia, you write code or use code, and you apply version control. And of course, you consent to the study. And lastly, we did one hour task and scenario based interviews with 42 scholars that did really deep dives into different scenarios where they would have to interact with Git and Git hosting platforms and tell us their approaches, motivations, frustrations and everything in between. So now we're going to present some of the results that will be a mix from these surveys, focus groups and interviews. So first to give you some information about our participants. So we asked folks in our survey about the role that they're in in their current institution, the length of time in that role, their discipline or area of study, the type of institution that they work at, whether or not they get funding, things like that. Our interview and focus group participants, we won't be showing that information here because it is reidentifiable on that level. That said, we do want to show some quotes from our interview participants because they do really speak to the heart of what we're talking about. So with that in mind, a lot of our participants, it was really interesting to see were actually staff more so than any other category. And I think that really speaks to who does who does what. So of those who answered 100 and there were 140 staff members overall, 45 of those are postdocs. This is obviously split up by discipline here, but overall those are the numbers. There were 81 students overall and 76 faculty members. You can see that a lot of them were in social sciences and STEM. 269 of their respondents said that they were in their current position for one to seven years. The majority were in the first three years of their role. So we kind of skewed earlier, people who were earlier in their roles, people who are in staff positions from the social sciences and STEM fields both. Overwhelmingly as well, our participants came from public universities. So it was really interesting to see overall 195 of those participants were in public universities. Again, those top categories being staff and doctoral students, then faculty. So that was a kind of interesting look at who is versioning scholarly code off the bat. Thanks Vicki. That was great. So now that Vicki has kind of told you the demographics of our data, who the people who our audience was answering these questions and letting us understand how they use Git. I'm going to go into a little bit more detail about Git in the workflows themselves. So we're going to start off with version control use. Keep in mind that this is just a snippet of what we found out from our survey data, which is then backed by our in-depth interviews, but it'll give you an idea of how and who and where and why people use version control, such as in their everyday workflows, within toolkits, what their motivations were to use Git and their general proficiency throughout their career. So first we'll start off with this. Their first and current version control systems used. Git is the most popular VCS across all groups. As you can see here, there's more than 300 of our participants who said that they actually use Git currently. And even it was their first use version control system as well. Following that, we see SVN as their first used, kind of like when they started up doing version control systems. And then Mercurial. GitHub was by far the most popular GHP with GitLab and Bitbucket following as their Git platforms. All right. So when we see here that how our participants self-reported their proficiency with Git by their status in their institution, we'll see that actually, like Vicky said, a lot of our participants were coming from staff, but they were all pretty proficient with an intermediate use, also in line with doctoral students and postdocs, and then decreasingly within tenured faculty and then continuing contract actually the least alongside masters and undergrads. Obviously, there's other reasons on to why there's different proficiencies among these, whether they're at the stage of learning it or who has ended up teaching them it throughout their workflows. But we were interested in seeing how there actually was a very few who felt like they were experts and more that swayed on the more cautious side of their proficiency. So in general, when we see that there's either people using Git locally on their local systems that can be within a their terminal or within a graphic user interface or within a GitHub, Git hosting platform, we can see that actually a lot of our participants reported local terminal. But there is a lot of actually people who are interested in local GUI because it offers so many other types of tools, which we'll talk about later on. And then we have a participant here who says, I don't really trust GUIs in that I don't really understand what's going on. Unfortunately with Git, I don't understand what's going on with that either. So I think the overarching theme we see here is that regardless of what tool there is kind of a continuing lack of a mental model, which we do actually talk about a little bit more extensively. And so in this next table that we show here, we are showing the frequency in which a lot of these scholars who participate in the survey, how often they use Git locally as well as on the web. So here you can see this is for local daily use. There is a majority using it within their frequency using both the hosting platform as well as locally. So we kind of see that as they're switching back and forth between both systems because of some features, which we'll talk about later. And then that also coincides with you see a very gradual decrease into yearly and then quarterly and monthly. And then on this next table, we have shown the different types of features on Git hosting platforms used by scholars of their status. We can see that forks in PRs are actually the most highly used types of features on Git hosting platforms almost across the board. And that's for reasons of collaboration, being able to reproduce and copy someone else's work. And this really just demonstrates the type of scholarly use and scholarly ephemera that is created when you're using Git and Git hosting platforms. So in this chart here, you'll see what were the actual motivations for using Git locally. Why did they adopt Git into their workflow when maybe it wasn't even included in say coursework or anything like that. So a lot of more of the staff or doctoral students who are not yet in a tenured position or in a stable position really thought that helping them get a job was actually very important. Overwhelmingly of course with Git is needing the actual version control system itself. And then we see that collaborators, their collaborators use it was also an impetus for them to adopt it. And then we had a slew of others as well which varied across the board. In this quote we see here that Git is seen as more difficult to swallow in situations such as programming. So in this case the researcher said, I think that for most people if I'm being honest, Git is the part that people have the hardest time seeing the benefit to when I'm teaching programming. They want to learn Python, they might want to learn the shell and then Git has this feeling of eating your vegetables. Even if it's hacky, I find it lowers the cognitive load. So this particular scholar and then a lot of others who teach it do see the benefits. It's just hard to get the learners fully embedded and convinced to see the benefits of it from the beginning in comparison to other things such as Python. So moving kind of back to this idea of motivations and why people use Git in this particular case is for Git hosting platforms. And again, we're looking at it by their status and institution. So the types of motivations are for keeping track of changes or collaboration, method tracking in their scientific process, the openness of data sharing. We have a little other category which has a slew of types of variables. And then we just have general publishing of content because it is an opportunity to have one's work viewable to outside viewers. And then the general need for version control systems. Another participant here talked about their motivations on why they wanted to learn Git. I want to learn Git because many cool projects shared on the internet was using GitHub and I want to know how they manage their code, how they share with each other. It's like a community. So the idea of collaboration is not just within your immediate lab, let's say. It's actually much broader than that. Let's go into teaching and learning. As a researcher or a scholar or a teacher, you have so many things that you do want to have interacting with to create this whole environment of doing a method or understanding workflow. So we wanted to see what is in this work we do want to understand how do we prioritize Git amongst all of the other types of tools that we need to learn. So as people were learning the and learning Git, here's their ease of learning by their status. So we can see, again, staff is a little bit more represented here, but they were more neutral in it being very hard or difficult to learn. Doctoral students kind of swayed a little bit more on the more difficult side, as you can see. And we can then see that continuing contract faculty had a little bit less of a say on the more difficult side as well. And then we can see that master students more so evenly distributed as well as postdocs across the board in their ease of learning. Here's a participant that says, the hardest time I had working on a project recently where I was using Git with branches and pushing things to GitHub to make a book down, it was a mental model and knowing what to do and how to decide. So yeah, this is like a lack of mental model pervades all my difficulties with Git. And like we were saying earlier, this idea of how to interpret the analogies that Git uses in their terminology, branches and trees. And so there's this idea of either being able just to know the actual scripts and codes or just being able to actually know what is the actual process supposed to look like in Git. And in this graph here, besides being able to learn Git itself locally, we also see the ease of learning of Git hosting platforms. If we had this graph side by side with the ease of learning of Git itself, you can see that actually the graph sways slightly more onto the left with a right tail. So that means that there's more of our participants did say that it's easier to learn Git hosting platforms, most likely due to the interface being much more friendly, maybe having some sort of support from the Git hosting platform itself with their documentation, and as well as the whole idea of buttons and features and their ways of best practices. In this graph here, we see that the frequency that participants reach teach themselves Git. So like any language, if you don't use it often enough, it's easy to forget. So we see that people might be teaching, reteaching themselves quarterly on a quarterly basis or very few on a daily basis. So as seen in many Git workshop materials, it is not a tool that someone can learn in one go. So teaching Git to others, we're really trying to understand not only how does one use Git and learn it for oneself, but then also how does it work within a team of because collaboration is so crucial to this type of tool and workflow. And so is it teaching others, interacting with others around Git? So as you can see, many people more than half have actually said that they have taught Git and do they actually regularly teach Git? That's a no. And then what materials do they use? Sometimes they reuse materials from others. Sometimes they make them themselves or just a mixture of both. And then our last major theme in our data here is about research and sustainability. So what are their management practices? How do they deposit code? And how are they preserving their research for the long run? Do you collaborate using Git? Most people said yes, which for probably obvious reasons, because that's why they joined the survey. How frequently a lot of them have reported most more majority in daily and weekly, which was nice to see. Annually is a really interesting type of practice here. Just once a year, you go into your maybe your scholarly website to update it. So many different types of speculations there. And then the collaboration practices and just being able to manage your code with new researchers joining a lab. Do you onboard new people to the version control systems? Majority says no. That's very troubling because onboarding process is very important, especially within this high turnover rate within academia. We can see that there's a high need in order to not just teach people how to use Git, manage their code, but then also figure out a way to actually onboard these co-researchers onto the project and onto all of the scholarly ephemera that we're creating on this platforms, on these platforms. So we also ask where do participants deposit code? Sometimes this can be a trick question because sometimes they don't deposit code. But luckily we saw a lot of people actually said they do deposit as an auto, which offers a DOI. We do see that 47% of participants deposit their code into a repository for archive or long-term access and reuse, like I said, from Zenodo. This was followed by open science framework and or an institutional repository. And then we were also just wanted to note that Zenodo's official integration with GitHub was probably one of the major reasons why Zenodo is a top-used platform in order to deposit code for long-term use. Outside of that, we do see big share in software heritage, which provide a valiant effort in this type of work. So do scholars want GitHub for academics? This is a question that we were interested in just seeing is it helpful to actually create this custom repository or platform itself? The majority said maybe. And there's from so many different reasons of either whether they only use Git and GitHub a few times a year, but we do like to see that there are some people who say no because they want to see that them integrated into software developers workflows or yes, in that they are actually excited for something specially made for them. In culmination of these three sections that I just talked to you about and showed within our data, research software is in fact foundational to scholarship. Number two, understanding authors, maintainers, and contributors is critical for preservation reuse of research. Number three, this research and our resulting data explores the behaviors, motivations, histories, and demographics of scholars who use version control systems. And then our last major theme of findings is the interviews and focus group provide more context for the why of what the survey reveals. So we just want to end by encouraging everyone to use our research materials. They're open and available via the qualitative data repository. All you need is a QDR account. You can download and reuse any parts of this. We'd love to see how it works in your communities. There are really endless possibilities with investigating this data as well. We gave you a very high level drive by view of everything, but if you really want to get down in depth, we definitely invite you to do so. I also just wanted to mention the project that is being built on on top of the foundation that Ice Age laid called collaborating on software archiving for institutions funded by the Alfred P. Sloan Foundation. It's currently ongoing. It's building on the research being done to sort of build a software curation programs through building software to help with archiving materials on the web, optimizing archival workflows, and building community and fostering knowledge about the importance of software management and curation for long-term reproducibility of research. So keep an eye out for stuff about COSI. Thank you so much. Feel free to follow up with us. You can always reference the Ice Age website for all of our presentations and papers. You can feel free to get in touch with Sarah or I and I just want to say thank you to Alfred P. Sloan for funding Ice Age and COSI. Thank you. Thank you all. Bye.