 Okay, hello everybody, and welcome. Welcome to today's reproducibility, I suppose webinar, but there is a little bit of a workshop at the end as well. And today we're focusing on collaborative working and in particular we'll probably be talking a little bit about Git and GitHub. My name is Joseph Allen or Joe Allen and I'm a research associate here with the UK Data Service. So thank you very much everyone for coming. So in this webinar, we'll cover the problems generally of research. We'll write an English essay together and we'll point out the flaws in our usual approach to writing. We'll discuss reproducibility, collaborative working and version control. We'll introduce Git and GitHub as well as some key terminology that goes with that technology. We'll explore a crime scene metaphor and finally we'll end on a demo using GitHub desktop. By the end of this webinar, everyone who's in attendance should be able to explain what Git is, explain what GitHub is, clone a repository, add files to repository and edit files on a repository. Before we get started, if you are hoping to follow along with the demo at the end, you will need a GitHub account. So you might wanna make that now while I'm talking. You'll also need GitHub desktop downloaded or some way of running Git locally. Though I'm gonna be using GitHub desktop for Windows and obviously you'll need to be logged into your GitHub account there. This is all recorded anyway and it's live on YouTube at the moment. So if you can't get that done or I speak too quickly at the end, you'll have plenty of time to come back and watch it basically. It will be immortalized on YouTube. So to get started, what are some of the problems of working with data in research? Some of the things which can make research hard are the following, useful data can be inaccessible, researchers are encouraged to protect their methods, media encourages positive results, results often cannot be repeated, academic work is mentally very difficult and attempts at solving all of these problems with more accountability processes are often very boring, low priority and skipped by academics. So let's explore all of these in a bit more detail. One thing which makes research hard is that data is often protected and with very good reason. Data may carry considerable risks to individuals. For example, we could identify protected characteristics from health or crime data. If we didn't protect sensitive data, we likely wouldn't get any access to sensitive data. Sensitive data is rightfully only given to researchers with valid reasons to access it. And a downside of this is that even research with very transparent methods may not be reproducible to other academics. An ideal future for me is that data providers will give access to journals and other academics once we have a standard for reproducible research. Academics are often very protective of the methods they use in research until that research is published. Researchers may keep their methods secret out of fear that somebody else could steal their work. Academics are encouraged not to broadcast methods until at least when their work is published. Further to this, methods may intentionally be simplified so research can apply those methods to upcoming work. We also lose all unpublished methods. In a less malicious example, academics may simply not remember their methods at the time of writing up due to poor note-taking. Academia encourages positive results despite the huge value of negative ones. A positive result is obviously more likely to be published and this is well documented. Further to this, a positive or a controversial result is more likely to be adopted by mainstream media outlets. Without pressure to make methods reproducible, academics can tweet their analysis to force the discovery of a positive or controversial result in a particular sample. Most results have very little work done to make them reproducible. Academics protect their methods, making it difficult to work on top of somebody else's findings. Data is protected as we've discussed with very good reason and often the details of accessing this data are not shared as part of that research. Recently, the field of psychology has undergone replicates with 70% of researchers failing to reproduce another scientist's work and further to this, 50% of that sample failed to reproduce their own results. Beyond all of this, academic work is mentally difficult. These aren't the people that we should be giving extra work to. Academics are often some of the most knowledgeable people on the topic in the entire world. Academics are dedicated to trying to do something that nobody else has ever done, building on years of established work. Between deadlines, financial and social pressures, there's very little time to take on extra work unless it's truly required. I'm just looking at your question, Lawrence. Has there been any formal research to assess what proportion of research students, especially in the social sciences field or proportion? I think you might need to give me a bit more context for that question. Sorry, Lawrence, I'll have a look at it. I'll answer it in time as well. Did I answer this? Accountability work is generally boring. Time sheets may be seen as skippable pieces of compliance without consequences in some institutions. Staff may be too busy to check accountability work, so failure to comply can be seen as a successful strategy and will spread throughout schools. Between the other pressures listed, why should we add this extra work? Well, the reason we should be adding this extra work is that it saves everybody loads of time in the long run and makes all of that work way more verifiable. We don't have to solve all of these problems all the time. The goal for now is to make our work a little bit more reproducible. If you write just a little bit of documentation on how you access the data or take backups of your work somewhere off your main laptop, you're probably in the top tier of academia. If you only make your work reproducible to one person and that person is you, to me that is still very, very valid reproducibility work and on the right track. Cool. Now let's move on to a quick example. It might be more familiar. Let's outline how we might write, say, an English essay or a book report or a film studies piece of coursework or however it feels relevant to you where you are in the world. So my example, one I did back in 2008 was how is freedom represented in the film The Shawshank Redemption? So I'd ask you, how would you write this essay if it was, say, a 5,000-word essay? How would you begin to break down that process? Would you write down some notes? Would you sort of build a structure before you even started writing? Would you keep track of all of your drafts? And if so, how would you do that? Would you ask any of your friends to review your work or would your teacher or a professor review your work? And then how would you manage their feedback? How would you action the sort of 10 little quirks they might say? To start with, I might break down the entire essay into four sections and I might call this a layout file or something like that. So I might just have four bullet points, maybe introduction, point one, the use of bird imagery, maybe a second point, the progression of time and finally a conclusion. And there's value in this work, right? This is a plan I've come up with and I should probably save that plan somewhere. So I might save this file as notes.txt or notes.doc if you use Microsoft Word or something like that. And then next I might flesh these ideas out with a couple more quick bullet points. I might remind myself that somebody in the film had a pet crow, lots of birds eye shots are used, things are slightly foreshadowed in this film and then I would save this. So this is arguably a better plan, right? Arguably better notes. And I might save this as notes.txt. It's not a great name. Is this a separate file? Should I have notes one.txt, notes two.txt or should I overwrite notes.txt with these now better notes? These are sort of very intentional things to do with files that I don't really think we question. And another point there, what if I decided that these bullet points weren't great and I wanna get back to my notes one.txt? I would just have to remove these extra bullet points but if I had that original file, I wouldn't need to do that work either or I could keep clicking undo until I think I'm back at the notes one.txt stage or it might just be lost entirely. It might be too complicated to get back to that point. So now I'd write out my first two points in greater detail. This is my first attempt at really writing an essay instead of just having sort of a notes style plan. So as such, I'll save this one as essay.txt. And this might seem like a good name but this isn't really an essay yet. It's still more of a plan than anything else. So even though I've named this and sort of tried to manifest that at some point there will be an essay in this file, at this point that essay doesn't exist. It's just pieces of an essay coming together. This is the place that our essay will eventually be. Next I write a conclusion and I'll flesh out that introduction as well. And at this point, all I need to do is just finish up this first attempt at an introduction. I think at this point I would have a full draft. What the heck? Where did my screen go? Okay, so at that point it's possible that my computer stopped working. That's really annoying. Was it a power cut? Maybe my laptop broke. When did I last save? I didn't think I've saved since before I wrote that conclusion. And that's really annoying but I'm incredibly grateful that it was just a power cut and not that my entire computer is broken. I could have lost years of research on the Shawshank Redemption if it was all just saved on that computer's hard drive. So I rewrite that introduction again, not too stressful because I only just wrote it. I've learnt my lesson. So I'm gonna save it after I write the introduction. I rewrite the conclusion again and I save it. And I save this entire file now as essaydone.txt. And now it's time to proofread my work. So I might realise I've spelled Andy Defrens' name wrong the entire way through my essay. I'll fix any other spelling or grammatical issues and I'll save this again. Now I'll save this as essaydonefinal.txt. This might sound a bit more familiar. I know I've done this many times. There's no indication here of what was fixed. I could have called it essaywithspellingcheck or essaywithspellcheckrun or something like that. But instead I've just called it essaydonefinal as if this is truly the donefinal version and not just the doneversion. Next, my teacher has offered to review all of my classes essays once. So this results in a marked copy that I'm calling teacher's copy. So she receives my teacher copy with no changes which is a direct copy of essaydonefinal. The teacher then works on this new file, teacher marked essay Joe because they need to know that it's mine, right? If my teacher has 10 copies all called teacher copy then they don't know if I'm Joe. So they need that context somewhere in the file or in the file name. So they work on that new file teacher marked essay Joe which they returned to me. And then I would work to action all of the little typos and strange things they found there. And maybe I'd save that as essay actual final. So at this point I've got essaydone, essaydonefinal, essay actual final. And then to make it even more complicated I might have a twin brother who also has to submit the same essay because that twin brother is in the same class as me and we work on the same computer. So at the moment our desktop is just littered with all these copies of our projects. And whether maliciously or not my brother might think essay actual final refers to his essay actual final and might submit that. So because my brother submitted first and then I go in and submit my essay actual final it does look like I copied and plagiarized from my brother so I could end up with an instant zero in this coursework and I'm only relying on the goodwill of my brother to clear that up. The teacher could potentially notice that but again it depends if they still have access to this teacher marked essay Joe and Sam. So throughout this process a couple of things really went wrong and in the small scales of maybe a GCSE coursework it's not too problematic. But if that was a year's research and some of those problems were happening it would be quite nightmarish. Some of those problems would mainly be fixed by what's called a version control system which I'll get into slightly later. But some of those issues were we were overwriting a file which means that the only way to make save points that we could revert to was to create multiple copies of those files. We were hiding the context of each of those files in the file names as well. So we were prematurely calling something an essay and then that meant that we had to say final essay in the file name or essay with feedback from teacher. We're really looking for a way to sort of store that context outside of a file name but that's the best solution we have without a version control system. It's obviously possible we could accidentally delete all of our work. Our work is as fallible as our computer was. If our computer broke or was stolen or my house burnt down all that research on Shawshank Redemption would be gone. And when we do collaborate with the teacher it results in some mess occurs, right? We get some extra versions. We don't really have actionable fixes. We just sort of end up with one commented on document and we have to action those fixes however we choose to do so. And then finally we don't actually have any proof that we wrote an essay, right? We could have at the end just actually copied our brother and we would have been in almost the same situation except our brother would probably be more annoyed. So somebody else could submit our work and we don't really have proof that we built up that work slowly over years and years. What would have helped us in this use case and more generally in most of research is the ability to trace our route to a solution. While we're talking generally here about reproducibility one particular tool known as version control systems help out big time. So to me reproducibility is about creating your work in a way that others could easily understand and build on top of. We can make our work reproducible by making it very clear how we arrived at our results. Good reproducibility should be like following the recipe to make a cake rather than showing that photo of the cake to somebody and expecting that they can then make the cake. This recipe should contain things like the data, how this data was acquired, where somebody should look to find this data and tip somehow to access that data. The tools, the materials used, recommended papers, software recommendations and anything like that. Results as objectively as you can possibly list them. Access if possible though we covered that a bit in data. And as a bonus it's very clear that you did this work. It would be a lot of effort to go through to pretend that you've slowly built up research over years and years. You may as well just do the actual research. When we make reproducible, we benefit as the authors. It's much harder to forget why we did something or where we put it. Our analysis is much more well documented and hence it's easier to tweak. It's easier to go back and adjust if needed. It's easier to write up our work as a paper. Since we have outlined that entire process from start to finish along with the work and the updates to any notebooks or programming files we might be using. Our work can more easily be verified by journals, academics or data professionals from the general public. And we have evidence as I say that we actually did that project over a long period of time. The journals publishing that reproducible work will benefit. There's less risk of a large scale reproducibility crisis like we're seeing in psychology at the moment. Journals are increasingly encouraging the publishing of only reproducible work. So that standard is raising and if your work isn't reproducible you might find in the future that that is actually your barrier to getting into journals especially in sort of more computational fields. Journals could verify those results before publishing if they are truly reproducible. And negative results become more valuable because journals might be incentivized to publish useful methods rather than useful results. Not that results aren't always useful if they're not positive. Other academics also benefit. Data access becomes more well documented alongside interesting pieces of work. If an academic wants to build on top of somebody else's work they can easily identify how to do so if they can find somebody else's work. Replicating the methods of another academic becomes trivial though understanding them will still require hard and very cognitive work. More time is dedicated to reproducibility and furthering the field over repeating studies that have been done but not published. And negative results are more likely to be published again further reducing this repeated work. The general public now finds academic work more accessible. Interested data professionals may choose to verify or enhance work with public interest and government policy becomes more accountable as any individual could attempt to verify those results. Now let's drill down specifically on what collaborative working is and how reproducibility works with collaborative working. We could define collaborative working as something like this working with somebody to produce something very generic. But I argue we can go a bit further than this because this makes reproducibility look like a selfless act and I argue that it is a selfish act and that is a good thing. The person most likely to benefit from our own reproducible methods is almost always ourselves, right? When we save a file we're saving it for our future selves we're saving it for when we come back tomorrow to continue working on it. We're not thinking two years in the future when we might need that file. So when we write something in a notebook or a little bit of analysis or we write some Python code that's for us for now anyway. So to extend this collaboration can be working with our past and future selves and this should be the biggest motivator here. Making our results, methods and packages open for anybody is still collaboration whether we see benefits that or not. And sometimes it's much more efficient to find somebody else who finds a skill we're lacking than it is to learn that skill ourselves. So if we can make things open and accessible it's much easier to share it with our friends to verify or to share it with coworkers who might try and verify or build on it. So I would tweak that definition that I've just given you collaboration is the intention to work with others and that others can be us and that's okay. So now by working in a way that helps our future selves and maybe other people we are doing reproducible work but it's not necessary that other people use it for it to be good work. And those problems we talked about earlier generally get easier. So the useful data is still protected but it's a bit easier to get approved access because we're sharing the way we access that data. Methods are protected with accountability but also more likely to be shared and verified. Positive results are still favored but negative ones may find value in future research. Results become reproducible in future work but our foundations are still at risk, right? We haven't always had reproducibility in the past. Academic work will remain mentally difficult and is slightly clouded now with this accountability push to be reproducible. And accountability is arguably still quite boring but modern tools making this easier and faster to use. A lot of these benefits come from very specific system that I keep mentioning called a version control system. And with this is lumped to the term get which you might have heard a lot as well. This is the most popular version control system. So I would expect in the future you probably won't hear the word version control system more than you're going to hear it in the next 20 minutes but you probably will hear get quite a lot and probably have done so. A version control system or a VCS is a method of recording and preserving the history of changes made to directories and files. So this could be the ability to roll back to previous versions of a document. For example, in that notes one and notes two this is how we could jump back to notes one and to notes two without needing to click on do a bunch of times without needing to have a separate file saved. This system facilitates that. It also facilitates storing those multiple copies of a document if you want to. And in a more formal system this is a software solution that's shared between multiple users to encourage that collaboration. Some of the benefits of a version control system are as follows. Files are part of a single project. The current version usually exists on our machine or the machines are people working on that project but all previous versions of those files and all the variations are saved in what's called a repository. And again, that repository could be on your local machine or it could be somewhere based in the cloud. A base version of this project exists and the version control system will save all subsequent changes from that base project. Commit messages can be used to describe the changes made to a project. So this means we don't call a file final version to thesis instead we will just save that file as thesis we will describe the changes we're making at any point. And if we mess up, as I say we can retrieve earlier versions of the project. A good system will allow other users to copy or fork our work and submit their own changes without permission. And we can also determine who wrote an individual line of code or text and chase up with who was responsible. So Git is one implementation of a version control system. There are multiple different implementations you might have heard of such as Amazon's Beanstalk or Mercurial, there's a really common one that I can't remember anymore. But Git is now the most popular one at least using software development. And I would say in academia as well I've seen it mostly use. But that's only happened in the last 10 years has Git become the most popular. Git was created by Linus Torvalds. He wanted to make it easier for people to contribute to the Linux operating system. And it's usually used as a command line tool but there are graphical interfaces to make this really easy such as GitHub desktop which we're about to do a demo of. But there are other choices as well. There's one called Kraken. There's a bunch of different GUIs you can use to sort of make it not as scary as it is. So we briefly mentioned a repository that's going to what that is. So a repository is simply a storage space for a collection of files. So you could call a folder on your computer a repository if you want to. But in a version control system this word has sort of a slightly different definition in that it's just a collection of files or it's that base project that we're building up on over time. So sometimes repositories are locally called repos and a repo can be locally hosted on your computer as I said, but usually the repos we're using are an online hosting service. So GitHub, GitLab, SharePoint, these are all repositories. And that leads us nicely onto GitHub. So GitHub is a, I'm not sure I agree with what I've written here, but I say GitHub is a repository. I suppose it is. But really GitHub is more of a social media platform for developers, for scientists and it facilitates collaboration between users. So you can make an account on GitHub and here you can create multiple projects or multiple repositories. And those are the repositories that people will copy or fork or change. So it's a place for those projects to live or those repositories to live that isn't on your computer. GitHub is owned by Microsoft and they release free updates to improve the way that programmers collaborate in general. One recent feature developed was called GitHub Copilot which allows developers to use artificial intelligence to write code for them. GitHub repositories obviously rely on Git to work which is why it's called GitHub. Some of the useful features of GitHub are as follows. Every repository has space on its homepage for a friendly read me to be written. GitHub will render a few useful file formats so it will render PDFs for you, code with syntax highlighting and even like notebook files or Jupyter notebook files. Services know that they should support GitHub because it's so popular. So you can upload notebooks to services like Binder directly through a GitHub link and you can also use Synado. I think that's pronounced to sort of generate citable DOIs for your research. And GitHub facilitates collaborative working by hosting free project boards, issue tracking, discussions, wikis and enabling users to publish their own code to fix issues in popular projects. There are a few terms you might have also heard tossed around in association with Git. I'm gonna quickly define these terms before going into a metaphor. So this is a quick cheat sheet. I'm not expecting, if you're very new to Git these words probably won't really mean anything to you. So this is probably the most important slide that you'll need. I suggest screenshoting this one, but again the slides in the video will be available later. So a repository as we've said is a place to store a project. Usually an online service like GitHub will do this for us but a repository could be on your computer but generally I think of this as like a cloud-based place for a single project. Cloning is then when we make a local copy of a repository. So usually we're cloning from something like GitHub, a single project or a single repository on GitHub but cloning that to our local computer. Pulling is when we update that local copy. So we might have cloned it a year ago and we haven't checked to see if anyone's done any work. So we would pull to pull in all those new updates to that repository. Pushing is when we've made updates. So our local copy of the project is further ahead than the copy that's available on the cloud. So we push our updates and that will update what's on the cloud. Conflict is when two users might push different things at the same time. So if I change a typo, maybe I wrote my name Joe and I wrote it all lowercase. I might wanna uppercase my name, J-O-E and uppercase J but somebody else might have also done that or they might have replaced the name Joe with their own name. This would cause a conflict and one of us would have pushed the cloud first and when the second one pushes get, we'll notice there's a conflict there and we will have to work together to resolve that conflict. A commit is when you capture your changes since you last pulled the repository. So you can optionally add a friendly message. So if I add a new file to a repository, I can just push that up with nothing but that's not generally what we do. Normally we try and write like a little friendly sentence to our future self to say, added this file to do this or fix these typos in this or fix this error here. So committing is sort of, it solves the problem of keeping that context inside the file names that context of where or at should not be associated with the file names at all. Adding is when you add something to a commit. So just the creation of a new file wouldn't add something to a commit. You'd have to manually say, I want to commit these particular files but not these files because those are just my working files. Removing then is the opposite of adding. So if we want to stop tracking a new file, we could do that with removing and then status will return the current state of your commit that hasn't been sent yet. So since you pulled, you've made some changes, your status will be those changes but you'll have to add those changes before you push them. So loads of very strange, very different terms there that I'm not expecting anyone to really keep up with that should all sound quite strange and new because it's a very new way of thinking but to get over this, I've got a strange metaphor that we can go through instead that might help. So there was a lot in that slide, as I say, it's a cheat sheet but let's see what's happening instead in the world of detective data service. It was the snowy kind of day you always saw on TV but snowy days turn into slushy nights and on this slushy night, one dataset was pushed too far. The only thing longitudinal about this dataset was that it was laying down in the pink snow, a bleeding shot right where its pivot column should be. It was a case detective data service didn't want but with the current funding cycle, it was a case detective data service would have to solve. The police station known by its nickname, the repository was a great place to store our investigation. In previous cases, detective data service had stored the investigation in their office but that hadn't always gone down smoothly. We clone the repository to our notebook, notepad maybe to make a copy of all notes and evidence that we could take home with us. So even though those files are still safe at the repository, we do have a version that we can play with us at all times. At the scene of the crime, we discover two things of note, a single orange hair and a data deposit slip signed IOU. We add these to our local files. So if we check the stats of our changes, we would see the addition of that one IOU and one orange hair. We write a commit message in our notebook, notepad I should write to later provide some context on what we've added. So I might add a friendly message, something like found evidence at crime scene. This creates a save point in our investigation that we could revert to, you know, if we were a year from now and we wanna jump back to the files we found at the evidence of the crime scene, this commit message would help us identify that. And there's little value in keeping this just on our notepad, right? Nobody can pull from our notepad. We need to push these back to the repository and quick. Once we're here, we push our changes. The repository now has access to our new evidence and the commit message and makes a copy that they store at the repository. So they have a message now that says found evidence at the scene of the crime. That will say we added these two new pieces of evidence. While we're here, we also noticed that the repository is slightly ahead of us. They seem to have a piece of evidence we don't have a single blonde hair. What's going on here? The commit message from the repository reads found blonde hair in alleyway at the scene of the crime. So that gives us some context that was only available in that message. It would be silly to save this yellow hair as yellow hair found at the scene of the crime in an alleyway. That doesn't really make sense. So we pull from the repository and that gives us that copy of the yellow hair as well. And in the alleyway near the scene of the crime we find a trail of yellow and orange hairs leading back to a very grumpy looking orange cat. This isn't really relevant to our investigation at all. So we should remove these pieces of evidence. If we check the status of our changes since we last pulled, we can see that we've removed one orange hair and one yellow hair from our investigation. We make a commit tagged with a friendly message for later remove cat hair from the case. We return to the repository without findings and we push that new commit, remove cat hair from case. This request to remove the hairs from evidence as they're not relevant to the case. But there seems to be a conflict. So while we have solved that these hairs are not important and belong to a cat, another detective has pushed some notes with the commit orange and blonde hair belongs to the mayor. This is our conflict and conflicts must be resolved. So only one can be correct and we must agree this with the other detective. In this case, it seems that the other detective was wrong and didn't notice that these hairs belong to the cat. In fact, they were trying to frame the mayor for murder thanks to our frequent use of the repository writing good notes and commit messages. Our work has been reproducible and it's very easy for us to verify that we actually did our due diligence in this study where that second detective didn't do so and now they're headed to the data service prison archive whilst also proving the innocence of the city mayor. Well done team. So I hope some of that sunk in. I'm aware I've gone through two very sort of ridiculous ways of walking through what get is and a lot of this strange terminology. It should feel very new to you. It should sound very weird but it will get better with practice. So up next, let's do a quick demo of how we'd write an essay using GitHub and GitHub desktop. So if you have installed that, I'll go through some of these exercises without any context just so we can sort of try them out first and get used to GitHub desktop. So you should all be seeing GitHub desktop now. I'm just going to stop sharing just to make sure that I'm sharing the whole screen actually. Okay, so you should be seeing GitHub desktop. It should look something like this. So the first thing you're probably going to want to do is create a new repository. And in order to do this in this top area, you should have something that says current repository or if it's empty, it might already be prompting you to create a new repository. And if you click there, you can select other repositories you've already downloaded. Though if this is the first time using it that won't be there. So what you want is this little button that says add and in there, you can either clone an existing repository, create a new repository or add an existing repository. So we're going to create a new repository and we can just call this something like demo or test or whatever you want. I'll call it demo. And that's just going to create one in my documents and GitHub folder. So that's the default location for GitHub desktop. So I've now created a repository. So if I go to my documents folder and my GitHub folder, we should see we now have a folder called demo and in it, it's just got some Git descriptions but there's not actually anything in it. So at the moment, this repository only exists on my local machine, right? It doesn't exist on my GitHub account, which I'll also get up. So this is my GitHub profile. And I can see all my repositories here. So you can see the demo repo I made last night when I was testing this demo, but this one doesn't exist yet. So what we've done here is we've created a repo but we haven't actually written any commits or done anything. So the next thing, so all this sort of creating and navigating between repos will be in this section here but you'll always be prompted in this button here or this blue button here to publish, push, pull, fetch, all these things that will help you keep up to date. So at the moment we want to publish and it will ask us if we want to name the repo the same thing online, if we want to write a description, if we want it to be public or private. I'm gonna make it public by unticking that just so if you want to, you can go see that. So now that has been pushed. So all the file contents are still the same but if I refresh my profile here, I now have a demo repo that was updated 18 seconds ago and that contains the exact same files that we have in that repo. We have the same commit history that we created it. And GitHub desktop will do a lot of the weird stuff for you. So it's a bit easier. So that's how we create a new repository and we'll try that again in about five minutes. So if that didn't work that time, don't worry about it. Next we're gonna make a new file. So if you just add any new file to this, we can do again a notes.txt but you're welcome to try something else. I'm just gonna say hello GitHub in that notes.txt and if we go back to GitHub desktop, we'll see that one file has changed. It's telling us that already. That file has got this little green plus here that's saying this is a new file, not a modified file. Then over here, it'll tell us what's changed in that file since it last existed. So all it's saying is with the plus and with the green color, it's saying hello GitHub was written here, nothing was modified, nothing was removed. It's just all additions. And then down here, we can create something. So it says create notes.txt because it's trying to help us write a commit message but we can write one ourselves as well. So create notes.txt and write hello GitHub. And then we can commit that to main. So that just means we're writing that commit message. So now we're one commit ahead of where the repo on the cloud is. So when we push this, so again, we can either click push up here or push in this blue button. When we push this, that will go to GitHub. So if I refresh this page, we now have that new file is there, notes.txt. We can see the last commit that affected this file, create notes and write hello GitHub. We've got our whole project history and obviously the file has the same contents as it did locally. So it's all the same, it says hello GitHub. Now if we modify it, what if I say bye bye GitHub and save? This time we see some differences. So we've got like a little, oops, we've got like a little yellow pip here instead of that green plus showing that it's been modified. The same file has been modified and we can see that the words hello GitHub were removed and the lines bye bye GitHub were added. It seems to, yeah. So that's how changes and it's also given us that handy commit message update notes.txt, which I'm just gonna take. I think that's fine as a commit message. So now our history's got those three commits, the initial commit. We created notes.txt and wrote hello GitHub and we updated notes. That's not been pushed yet. So we can see that here. It's prompting me to push. It's saying I'm one ahead of online. So if I refresh here, nothing has changed yet because I haven't done that push. So I'm ahead of the repo online. But when I push that, that's all gonna be available up there as well. Bye bye GitHub. Cool. Now the next thing is, I might be on a new computer and I might not have access to demo. So if you delete that folder when the files are all closed, that folder's obviously not on my local machine now but it still exists on the cloud and any of you should be able to clone it actually in Kit. So if you want to download the code, you'll always do it with the screen code button and there you can download it as a zip. You can even click open with the hub stop or you can go down with the code. So if we click, I don't know if this will work. Let's try it. It needs to be again because I've deleted it but if it was on a new machine that would just open up for you. So clone again. All those files are back. I've got my demo folder. It still says bye bye GitHub. It's like it was never lost. It's basically a place to store your backups in that regard. Okay. So that's pretty much all the pieces we're gonna use. So now we're gonna actually try and write a quick English essay more than 20, I think we should be all right. So I'm gonna use the current repo anymore. I'm just getting an internet connection warning. So I'll just give it a couple of seconds. Right. So now I'm gonna create another new repo. So create new repository. I'm gonna call it Shawshank essay but you can name it after any book or movie you like or any book or movie you don't like that you'd like to write an essay about. I would suggest you tick initialize this repository with a read me and I'll show you what that does later. That's probably all we need to do for now. Click create repository. So now you should have your repository in there and you should be able to create your new one. I'd call it Shawshank essay but you can call it whatever you like. If you go back to your profile page you can see your repository. So we haven't published this. Remember it's just on our local machine. If we click publish repository, we get the choice. Do we wanna rename it when we publish it? Do we want it to be private? Again, I'm gonna make it public. That way you can all see it if you want to and I'll send that in the code, in the chat, sorry. So now I refresh this and I have Shawshank essay updated 15 seconds ago, I'll send that in the chat just so you can all see it is available for everybody. And this is a new file we didn't see before. So read me.markdown. A lot of good repositories online will have a read me and it's supposed to contain the instructions to sort of get started with this. So this is the file that will render down here. It says Shawshank essay. So this is where we could write some tips for people that are coming here. Open the notes.txt, read this essay, run this Python script, all kind of stuff, generally blocks in a read me. Okay, so let's start with add something to that read me. So I've got my Shawshank essay with a read me and I'm gonna say read essay.txt for Shawshank content. And we'll see here. So we've edited the read me file. We've added that line we just saw. Update read me sounds like a good message. So I'll push that. And again, we call that history locally but this history is also replicated on GitHub. Next I'm gonna do what we did before and create a notes file. So let's call it notes.txt. And in here I'm gonna write something like just introduction paragraph one, paragraph two, and conclusion, that's sort of the structure I want. So we see we've got a new file, notes.txt denoted with this little green plus and we can see all those lines I added. Create notes.txt sounds pretty good to me. So I'll push that as well. And again, maybe I'll flesh this out a bit. Maybe I'll say paragraph about birds, paragraph about time progressing. And I could have, you know, this could just be a quick way to make some notes, do some extra work while I'm on a bus or wherever I am. So add some notes on paragraphs. And at this point, I'm pretty happy with my notes. So copy that file. And I'm gonna move out to a sa.txt file as well. So now although this place currently holds my notes, it's going to hold my essay one day. So create new essay file, replicating note structure, something like that. And then now I can start writing. It's always pushed as well, sorry. So now instead of introduction, maybe I'll just start actually writing the introduction. So let's call it introduction, you know, freedom is represented in many ways in the short shank redemption. And I could say here, add introduction to first draft, push it and we can do the same with all these paragraphs. So paragraph one, birds can fly away from prison and paragraph about time progressing. Prison lasts a long time. So again, we can see we got rid of those lines entirely and added paragraphs, birds can fly, new lines, add paragraphs one and two. And then for the conclusion, I'll write something like Andy is not free, he is in prison. So I could say add conclusion first draft. Again, commit with that message, push it to origin. Cool, I've got some good questions. Actually, let's go through those questions now. Most people would say use MS Word to write documents and essay isn't no good to use that. So I would say a lot of files tend to not track changes very well. So notebook, Jupyter notebook files don't track changes. I don't think Microsoft documents tracks change as well either as it's sort of storing plain text. What I would say is, you know, a lot of modern office software, Libre Office, Microsoft Office or Google Docs does have a sort of version control system built in. I would probably maybe use GitHub in those examples to track like versions that you're happy with. You know, keep them in Word and don't worry about tracking line by line changes like this, but you still see points of I've finished this chapter, I've finished this draft that you can jump back to basically. And I believe that's the same like GitHub will track any files you get. The one caveat I'd say with that is that this is a free service. So GitHub generally gives you, I think, two gigabytes of file storage per repository. And I don't think any single file is allowed to be over 100 megabytes, which is quite hard to do if you're writing a document, but quite easy to do if you're trying to store huge amounts of data. So OpenOffice should be fine. But again, you might not be able to, you know, click it and try and view it in GitHub. You might not be able to actually read it here. It might not render those particular files, but I've personally not tried it with a doc. Is GitHub suitable for hosting sensitive data within a team or should I hold the shared code with the data stored locally? So what I would suggest is that you don't store data on GitHub, especially if it's sensitive, even if it's private and within your team, I try and come up with a way to access that data outside of that. But I would use GitHub to document how you access that data within the team. So you might have like a little, you know, 10-point bullet point list of how to get data access. And I would put whatever exploration you do. So if you have, you know, a piece of analysis that uses that data, I would put that on GitHub, but I'd make sure that you're not rendering the actual analysis of that data. So, you know, you might have certain lines of code that run, but don't store the outputs of those lines of code with the inputs that are sensitive, if that makes sense. But yeah, I thought I'd answer those questions just while we're actually here. Okay, so at this point, we've got our essay, where am I at? Right, so I think we've got a first draft at this anyway. I probably should have called it a first draft instead of, let's call it a first draft, actually let's get rid of these paragraph titles because they don't really need to exist. And I'll say, first draft finished. So again, instead of, you know, calling the file first draft, we just have a point in time that's referred to as first draft, where we know that the essay file was our first draft. And I can proofread it. So if there's any other typos, so I can see here, for example, I've spelled Andy wrong, we could have our spell check. Right, let's just call this, you know, spell check of first draft. So again, there's no need for a file called run with spell check or that you need to remember in your head that you've run a spell check on certain sections. You've just got that in your history. And you can see this history is building up our initial commit. We wrote a read me, we created some notes, added some notes on a paragraph, created an essay file, added an introduction, wrote two paragraphs, added a conclusion, finished the first draft and spell check. So we've got this like real accountability with this project that we're building up. Okay, next one. So at this point, I have my first draft, I could send it to somebody else to review. So I could, again, I've literally got a link to this file that I can send to anybody. People can go straight to this link and they can, what's called fork this project up here, though you don't need to do that. They could fork that project and make what's pulled a pull request with some changes. But what I'm just gonna do is actually edit the file on GitHub and pretend that I'm the teacher marking it. So I might say something on GitHub, like remove this title, remove this title. And I might say here, you've spelled Andy wrong, right? You've spelt Andy wrong. And that Andy type hasn't been fixed yet because I haven't pushed that commit. So this is something, these two things could be happening at the exact same time. And that would be a big problem if we were working in Google Docs, right? If we were just overwriting each other's messages. If it was Microsoft Word, we might be sending these files and emails and I might make fixes over here while my teacher might make fixes over here. But these can happen basically asynchronously. So that teacher commits that. I'm gonna push my fixes at the same time, see what happens. So we get an alert, new commits on remote. Desktop is unable to push these commits because there are commits on the remote that are not present locally. So that just means someone's pushed code to GitHub that isn't me and I've not got it yet. So I can fetch. So this is just where we check what those changes are and if there's any conflicts. There we go. And you can see up here, there's a little arrow. So there's one commit I need to make. There's one pull I need to receive. And we actually have a conflict. So this is something we haven't seen yet. We can actually fix this in notes. It might be easier to read. So if we look in, sorry, in essay. So in essay, we've got this new strange notation going on. So this is my version where I fixed the typo and this is my teacher's version where my teacher has left a note to remove the title and that I've spelled any wrong. So what I need to do is agree with my teacher. I'd probably be sat next to them two years ago. You'd be sat next to them and you'd say, okay, we definitely need this piece of code but we definitely need this text because that's new and that works with this API and all this stuff. So what you need to do is remove all the lines that contain any of this weirdness and you need to decide, do I wanna keep my teacher's note? So do I wanna keep my fixes? So I've already fixed the spell check and I can remove the title like this and then I can just delete that altogether and I can save that and make another commit, you know, fulfill teacher's review, I suppose, fulfill. And I can commit that, again, push. So that's dealt the conflict that we have with a teacher. Historically, that would be a piece of paper and we would have to, you know, make all those edits off the piece of paper, maybe wait for the piece of paper or wait for comments online, cool. And then we could say, so we see that the conflict there so we have actually left that one in. We could at this point just decide that we're really annoyed and we never wanted our teacher to review our work anyway and we can actually revert back to before that review ever happened. So in the history, we can say revert changes. We can do it with the, we can revert each commit one at a time. So I could say revert for full teacher's review, revert teacher's review. Oh, do I need to do both, push that. And then I can revert my teacher's review as well, merge files. You can also jump back to a particular point. So I could say here, you know, create branch from the spell check, no teacher. Okay, I think I messed it up a little bit but you can jump back to previous points or revert particular changes. So you can see there, I've just literally undone my review from my teacher's notes with that commit. And anything else? Yes, at this point, again, if I did have a twin brother, all my work is backed up on the cloud. So if this computer broke, if anything went wrong, that's okay. I've got that history showing everything I've done so far and why I've done it with friendly messages. You can see I've definitely spent 15 minutes on this project because you can see me committing like every minute. So you could see that somebody worked on something for months and months through that. So again, we've got more accountability there. But yeah, I'll leave it there. So I'll jump back to the slides for a second very quickly and then I'll get to questions. So that's not where I want to be. Okay, oh, one second, sorry. I don't want to start from the beginning. Okay, so did we fix what went wrong? It was a bit more complicated but we did solve the things that went wrong without version control. So we overwrote files but could at any point revert to previous versions of this files. The context of a change is stored in that friendly commit message instead of in the file names. We could still delete work if we wanted to. And if something happens to my computer, we only lose the most recent set of changes instead of all the work. That's just effectively taking good backups. We could easily collaborate with others as we saw with the teacher. We can both write notes at the same time that we might have to agree with how we deal with a conflict after that. And our work is well documented, reproducible, accountable and we have evidence that we did this work, right? So some time for some quick get tips. I'm aware that it's 2.57. So I will have time for the end or you might have to sit around if you want your question answered. Some get tips. The most common question I hear is how do I know when I should be writing those commit messages? I would see it as it's very similar to saving a project but the only difference is that the purpose of committing is to have a good save point to kind of jump back to if that makes sense. So we might save a file if we were gonna like quickly go for lunch or something but you might not want to ever revert to that mid sentence paragraph or you might not ever want to revert back to a piece of analysis that wasn't working or some Python code that was bugging unless you needed to back that up. So it's sort of with context would you want to go back to that point in time? Next I suggest writing readme's for your projects. It'll be helpful for you when you forget what you did six months later but it'll be instrumental to anybody else that ever looks at your project. And then finally I'd suggest try and visualize your project as sort of a tree of changes and with that sort of tree metaphor a lot of these words like pulling and pushing and merging begin to make a bit more sense as well but there's a lot to do to sort of get to that that mental model of this kind of stuff. So that's all from me today. If you liked what you saw and you're curious how to continue using Git I suggest just get comfortable doing what we just did there put some files on GitHub, push and pull those changes frequently. That's only the first step on this Git and GitHub ladder there's a lot more cool advanced stuff you can do and I've got some examples of that in the resources at the end of this presentation. The bar is really, really low here with reproducibility work. So if you have anything in the space of reproducibility any publicly available code any read me describing how you access the data for your project to me you're in the top center of academics the fact that you're here is a very good thing. Git is way more powerful than what I just showed so there's some example exercises here some interactive tutorials that I recommend over now other talk on reproducibility from the UK data service as well there'll probably be a series of reproducibility things upcoming as well. So make sure you're on the UK data service mailing lists and all that kind of stuff. There's some more difficult resources there and there's the paper that goes into the details on replicating psychology if you're interested to. So there we are any questions and if not though I see there are some questions I'll leave a slide of just my contact details so thank you very much for coming. I'm Joseph Allen you can message me on Twitter at Joseph Allen 1234 or at Manchester University on my email joseph.allonatmanta.ec.uk