 So welcome back, also if you're watching it on Moodle, oh wait, that's not, I just see that there went something wrong with the mood box thing, but I'll fix that as well, that it goes behind my webcam, because now if there would be too many people with a mood, then all of a sudden I would have to duck down to be below the text. Anyway, journal impact factor. So the impact factor of a journal, the journal impact factor of the academic journals is a measure reflecting the yearly average number of citations to recent articles published by a published in that journal. So the number of citations received in that year by articles published in the journal during the two preceding years divided by the total number of articles published in the journal during the two preceding years. That's the way that you would calculate the journal impact factors. But because of this, if you are a new journal, you only get an impact factor after two years of being indexed. So that is one of the drawbacks of the journal impact factor. So there is a lot of controversy around the impact factors, because it's not always a reliable instrument. So let's hit him with the text. Yeah, I'm going to duck a little bit. So because if you make a list like here above here, and you give people access to put themselves on the list, right, then people will start to gain the system. Just by making a list online, which is like an ordered list with numbers like 12345, people will start and try to gain the system. And journals do that. And so one of the things that journals do is that they publish a large percentage of review articles because review articles are generally cited more than research. Novel research doesn't rack up a lot of citations in general, but an article which very nicely describes all of the literature which is available, that will of course get cited much more because it gives an overview of all of the things. So there's also the fact that journals decline publishing articles which are unlikely to be cited, which of course is bad for science in general, right? Because if you have good results, then you want to have your results published in a nice journal. But if the journal says, well, this is a really good finding, but since you're working in a field with only five other people, then of course they will say, we're not going to publish that because we won't get the citations and it will hurt their impact factor. And there's also the thing that you submit your article in January, right, to a journal, let's say Nature or Science. And Nature and Science say, well, this is a top-of-the-line article, right? This will be cited hundreds of times. What they will do is that they will wait, right? So you will have to have your review period and then you get accepted in like end of the year, like in November. What they will do is they will only put your article online at the first of January because the date at which an article is published counts heavily because if the article is published in November, then the first year is only two months long, right? So there's only two months for this article to get citations. Well, if they would publish it in January, then they have a whole year, 12 months to rack up citations. And so, and not only that, but there is also coercive citations where an editor or a reviewer forces someone to put citations to their own work, which are not relevant or not relevant enough. And that happens a lot as well. So there are a lot of controversies surrounding the journal impact factor. And all of these things also kind of hold for author-level metrics, so author impact factors. Let me actually just do this and then I can move myself down so that you guys have a more difficult time hitting me with the text and everything looks good again. And so journal impact factors are nice, but there are issues with journal impact factors. So when we look at author-level metrics, there is the average citations per article, which is a very simple metric. But in general, it's not the most accurate metric. What is more accurate is the age index. So the age index attempts to measure both the productivity and the citation impact of publications of an artist. And it was invented in 2005 by a physicist called Georg Horge E. Hirsch. It was published in Knoss and it is one of the most cited papers ever, I think. So it's a paper about citations, which gets a lot of citations, which is interesting. But what you do is you order the number of citations called F from the largest to the lowest value. So hey, if you have like 50 publications, then you just order them based on the number of citations. And then you looked for the last position in which F is greater than or equal to the position. So at the first age paper, so if you have 15 papers, then the age index is the point at which your citation count is higher than the number of papers published. I didn't know citations are such a grind. They are, in a way. What if I just have one, if you have one paper or one citation? So if you have one paper with one citation, then your age index is one. If you would publish another paper, then your age index would be still one, right? Because you only have one paper with one citation and one paper with zero citations. As soon as your first paper gets another citation, so it's two on the first paper, zero on the second paper, your age index will still be one. Only when your second paper gets cited will your age index start climbing, right? Because as soon as your second paper hits two citations, then your citations, is papers, is age, will be two. And not only that, but there's like the simple age index, and then you have the five-year age index, which goes back only the last five years. So all publications older than five years are ignored in the five-year age index. Then there's the I-10 index, so that is the number of publications which at least have 10 citations. It's very simple and straightforward to calculate. The problem here is that it's only used by Google Scholar. So here you see the different citation indexes for me. I think this was in the beginning of 2017. And so you see by then on all of it I had 585 citations and the last five years I had 543 citations. My age index went down, right? So that means that I was a better scientist before compared to now, right? So the age index varies, and the I-10 index does the same thing. So it's a way of kind of showing the world. I could actually show you guys the current one that I have. So let's go to Scholar, go to my profile, and then let's show you guys the Firefox window. So you can claim your own profile and then where is the nice bar graph. And so you can see that by now I have racked up 1363 citations. Since 2016 it's 864. My age index is 16 if you count it across all time since I started publishing. And since 2016 my age index is 14. The I-10 index is 21. If you count all of my publications, if you count them only in the last five years, then my I-10 index is only 16, right? So that means that I should publish. Does it count if I cite myself in my next paper? Yes. When was my first publication? I can just sort by year and then go all the way in the bottom. I think it was in 2010. It's my first official publication. I did have a, this is a, it's not really a publication, but I collaborated in my master on this genome-wide identification of master regulators. And then this is a symposium where I joined. So these are not public, so this is not a publication this is from a symposium. And this one is a real publication, but I generally don't count it since I, it was done during my masters and I'm only somewhere in the middle. So I'm not like one of the first authors. But the first real paper that I published was the QTL paper and the RQTL paper which is also from 2010. And you can see that by now I have 63 publications in total, which also doesn't mean that much, but it's just a metric. Alright, so Google Scholar. So Google Scholar provides a simple way to broadly search for scholarly literature. You can search all scholarly literature from one convenient place. You can explore related words, certations, authors, publications. You locate the complete document through your library or on the web. You keep up with recent developments and you check who's citing your publications. You can create an author profile like I just showed you. But so Google Scholar has a lot of this advantage. It's more like a, it's almost like a social networking site in a way. And just like ResearchGate is very important nowadays or is getting more and more important, Google Scholar is kind of the same thing. But the problem with Google Scholar is that it includes a lot of dirty data. So by just Google Scholar profiles may not last and it's only a narrow kind of scholarly impact. Like I said, like it's not science unless it is in Web of Science. So these people here, or Emilio Delgado López-Corzar, Nicholas Robinson-Carcia and Daniel Torres Salinas, in 2012 or in 2013, they became the highest-cited authors on Google Scholar ever with millions and millions of citations. And the way that they did it is just by having their own website and putting PDFs on their website, which has nothing but citations to their own work. Literally hundreds of thousands of citations. And Google Scholar, it just takes the PDF, it says, well, this looks like a citation and then it credits you with one citation. So by having like literally a million citations in a single PDF all citing one of your articles, this article will just go through the roof in Google Scholar. And that is because Google it just indexes what's there. And they don't really care about the dirtiness of their data. Although you can nowadays, let me switch back to, they have this star here. And this star means that there might be something weird here, right? So the star will tell you that has citations count and then they say that there are two different versions of this article. One is cited seven times and the other one is also. So these articles were merged and then the citation count was merged as well. So it might be that it doesn't really reflect the official citation count. And that sometimes happens. And especially if you do retractions of articles. So they are getting better in kind of cleaning up the dirty data. But it's still not perfect and you can still become the highest cited author ever. Just read this paper. It's a really good paper to read if you want to learn more about how citation indexes work and how you can gain the system to make yourself look more important than you really are. But in the end, Web of Science is where what counts, right? And normally when I write a CV or something, I list not only the Google Scholar citations that I have, but also the citations from Web of Science because those are reliable because they those are more or less manually checked before they are entered into the database. Alright, so a little word about ResearchGate. It's a social networking site for scientists and researchers. But citations count only if I publish the paper where I cited someone. Yes. So a citation is if you publish in a journal which is indexed in Web of Science and you cite a paper which is in a journal which is indexed by Web of Science, only then do the authors get credited with one citation. So it doesn't matter if you cite the same article five times in the same paper. In the end, it just counts for a single citation. So having like blah, blah, blah citation to denierans, blah, blah, blah citation to denierans. If these two citations are towards the same paper, then it only counts as one. Clear? But the dirty citations, people didn't publish their work. No, for example, it might be something like a PhD thesis, right? In a PhD thesis, there are citations as well. And PhD thesis are generally not peer reviewed, but they are published by the university, right? So if you go to the University of Groningen, you can find my PhD thesis there. So Google also found my PhD thesis there. So all the work that I cited during the writing of the PhD thesis gets counted as a citation for these people. Well, actually, it's not a real citation, because of course, the University of Groningen is not an authoritative source according to Web of Science. And so according to Google Scholar, I cited these people, which I did because I did cite them in my thesis. But my thesis is not a peer reviewed publication. So those don't count according to Web of Science, but Google Scholar counts So there's a big difference in citation counts between the two. All right, so ResearchGate is a social networking site for scientists and researchers. It was founded in 2008 by Dr. Iliad Matij. And it's located right here outside. If I walk out of the building here at the Invalidesstrasse, there's the ResearchGate kind of building where they have their main headquarters. So the features include you can upload papers, data, chapters, negative results, which is very good, because negative results are very important. Research proposals, methods, presentations, software source code. Main thing about ResearchGate is that there's this ask and answer section. So if you have a question about, for example, PCR or DNA meta bar coding, you can ask that on ResearchGate and people within the field can answer to your question. And they can also vote on your question. And you can vote on the answer. So it gives you kind of a, it's kind of a combination between Reddit on the one side and Facebook on the other side. But it's really a nice tool if you want to find collaborators. So people who are working in more or less the same field as you, and you kind of want to find people to work with. So that's why it's really handy. But there again is a lot of criticism about ResearchGate. One of them is they are emailing of unsolicited invitations to co-authors of its users. So they will pretend that they are you. So imagine that I would write a paper together with Commando. Then as soon as I upload my paper into ResearchGate, they will start emailing Commando in my name. And that that's not done. And I don't think that they do that anymore, but they used to do that at the beginning to kind of get scientists to sign up to their website. And that's, that's a little bit tricky. And I think GDPR wise nowadays in Europe, that would not be allowed anymore. They have something like a ResearchGate score. I can, I can show you my ResearchGate profile as well, just to show you guys how that looks. So this is my ResearchGate. And then I go can go to my profile, right? And then it says here that my ResearchGate score is 33.79. And then I can go to things like my statistics to kind of see what my statistics are, right? And then they have these nice graphs and all of these things. But the thing that I wanted to show you is my experience, I think the scores. So you can see that in their score, multiple things count. It's not just the publications. It's also the amount of questions that you asked and amount of answers that you've given and the amount of followers that you racked up. So all of these things, they add up to your ResearchGate score, which of course is not really a fair way. And no one knows exactly how they calculate this ResearchGate score. It's, it's kind of their trade secret. But they weigh multiple things in there. And of course, questions and answers, like if I would be very active on the site and would answer a lot of questions, and then of course, I would artificially inflate my score, which would make me seem like a more important scientist, which doesn't mean right, you can make a ResearchGate profile, have no publications whatsoever and just start answering questions and asking, asking questions and answering questions, and that will improve your score. So that's, so it's citation social media. Yeah, yeah, it's, it's really a social media site. And nowadays, you have like literally hundreds of these sites, because after ResearchGate, Elsevier tried to do the same thing and a whole bunch of other journals. And you can see that I never go here. So I have 19 of these unread messages. But it is something that is there and it, hey, you can use it to promote yourself and to get more. And there are profiles on the ResearchGate site, which are not owned by research by real people, but which are created automatically. And sometimes even debt people have a ResearchGate profile, which is of course not really nice. And they are often criticized for failing to provide safeguards against the dark side of academic writing. So there's a lot of fake publishers that use ResearchGate to try to convince PhD students to kind of publish in their non-existing journal and rack up like or make money from that. Because normally when you publish something, you have to pay publication costs. And these journals are not real journals, they're more or less ghost journals, so they don't really exist. Or at least not on paper. So they try to find people who are more or less at the end of their PhD and they just need one or two more publications to get a cumulative thesis. And then they try to kind of have these people publish their work in this non-existing journal. And ResearchGate is really bad in separating out real journals from fake journals. And it's just, it's not good. Alright, so that's all that I wanted to say about like social media or social networking for scientists. And of course like using Twitch is just a different form, right? I have 77 followers on Twitch, which makes me a very, very famous scientist or not, I don't know. But you could mention it, right? Like as soon as everyone in science starts talking on Twitch about science and bioinformatics and stuff, then of course the number of Twitch followers that you have starts counting them for some weird reason. But that's not the way that it should be. In the end it should be about the quality of your work and the quality of your work is measured by the number of people in your field that agree with you and people agreeing with you is mentioned or is kind of citations. Or that's kind of the metric that we use nowadays. So about managing these things, right? Because if you write a paper or if you're starting to write papers, then of course you have to make citations. Hey, you cannot just write a paragraph and just come up with it like that. No, you have to build up your argument, right? So you have to say, well, author A proved this, author B proved that. So if we take these two facts combined together with the results that I just had, then we find out this is the way that it is, right? So this is the truth. So to do these or to manage scientific references, you can use a reference manager. And a reference manager manages things like publications, reviews, bookmarks, notes, and the idea is that you collect and you read and you integrate all of these references in a manuscript. And normally that's very, very time consuming. Like when I did it, when I first started out in like 2007, 2008, no one told me that there was something like Mandalay or EndNote. So I was just manually typing over all the information needed. And then when citations had to shift from one format to another format, because you decided, oh, I don't want to submit to nature, but I want to submit to science. Had the citation style from nature to science is different. So you had to go through the whole paper and change all of the citations. But reference managers kind of make that a very easy job. Me too at some point. Yeah, at some point, everyone wants to publish an article, right? Like that's the reason why we're inside. And so the idea of scientific reference manager is to not have this. So not have a desktop which is filled with all kinds of icons. But everything is structured in a really nice way. And so citations, you can kind of identify in a unique way. So for books, if you want to uniquely identify a certain book, then there is the ISBN number, the international standard book number. Then things like specific volumes or articles or identical parts of a periodical are having a serial item and contributor identifier or a SICKI number. Most journals or most articles published nowadays get a doi, a digital object identifier. And not only electronic documents can get a doi, but you can also get a doi on things like a data set, right? So if I have done a lot of work in the lab, and I've collected like millions of data points on like hundreds of samples, then I can take my data set and I can then ask for a doi on that data set. So when someone else uses my data in the future, they can cite the data set, which means that you don't have to write a publication. No, you just have to make your Excel file or text file or whatever you use to store your data. You just have to make that publicly available. Besides that, biomedical research articles get a PubMed identifier. So like the PubMed database, when your article gets indexed by PubMed, then PubMed puts next to the doi, it also puts a PubMed ID there, so that you can uniquely identify it. So a reference manager supports researchers in performing three basic steps. It helps you search relevant literature. It helps you store relevant literature, not only the literature, but also things like notes or PDFs or other things. And it allows you to insert citations and references into a chosen style when you write a certain text. So I hope that I have the example still here. But there are a couple of citation managers. I think that there are two to three very big ones. But the ones that I want to talk to you about is EndNote. So EndNote is a commercial reference manager. It is available for Windows and Mac OS X. So that's the reason why I'm not using it, because I'm running a couple of machines that run Linux. So then I cannot use it under Linux. So EndNote groups are referenced into libraries and the file is called an enl file. And an enl file is a corresponding data directory. So if you want to give your library to someone else, you have to send them the enl file plus the data directory, which belongs to it. So references can be added manually. You can export them from the web. You can import them or you can copy them from another EndNote library. And nowadays, everything is in the cloud. So also EndNote has kind of a cloud solution, which is a web based implementation of EndNote. And here you have the integration with the web of science, which is really nice. Beside EndNote, you have Mendeley, and that is the reference manager that I am using most of the time. It is a free reference manager, and it's also an academic social network, more or less, just like ResearchGate in a way. Mendeley also has your own Mendeley profile and all your publications and links to Scopus and all of these things. But it's kind of the same as ResearchGate without the question and answer section and without the mailing people and these kinds of things. So it's available on Windows, Mac OS X Linux. It's nowadays even available on your smartphone if you want. And you can back up and synchronize across multiple computers and via an online account. So everything works via the cloud. It has an integrated PDF viewer where you can put like sticky notes. They are text highlighting and you can do full screen reading if you have a screen reader or Kindle or something like that. But the PDF, so the sticky notes and your highlights and stuff, they are saved with the paper. So when you share a paper or a citation that you have with someone else, they also can see your sticky notes and your highlights, which can be really, really handy. They have an app for phone and iPad and it provides readership statistics about papers, authors and publications, which is also kind of nice because it helps you to see which of my papers are people reading, right? Instead of just looking at the citation count, you can get an idea of which of the topics are interesting to readers of your articles. So very quick examples. You have to create an account and after you create an account and you can download the software, when you start the software, you have to log into your account and you will have when you want to probably install the plugin to integrate it with Microsoft Word or or some other text editor that you use. I think they also have a plugin for LibreOffice and these kinds of things. And you can add a bookmark in your favorite browser and this will allow you to just click the bookmark when you are, for example, on a web page or on a PDF document. And this will then automatically make a citation to the web page or to the document that you are currently viewing, which is sometimes really handy. I don't have it installed because I don't like. Things in my browser, I like. Well, I do like things in my browser, but those things are like ad blockers and these kinds of things. So after you log in, then it looks like this. And here you have the literature search and the Mandalay suggests. So the literature search is more or less where you can search for literature. And Mandalay suggests this kind of this social media thing where they look at what you've been reading in the last couple of months and then they suggest articles which might be useful for you as well. So, for example, if I search for cow genetics, right, then you get a whole bunch of papers and then you just can you can click on the paper that you want and you select the paper and then you say, save the reference to my library. And the nice thing is it also gives you the abstract, right? So you can see if this is really the paper that you want to cite. So it gives you an overview of what the paper found. Once you have saved your reference, right, then you might want to cite this paper. And so when you go to Word, as soon as you have your Mandalay installed, you get this references tab and have, for example, you can say we use the whole genome assembly of Boss Taurus. And then you say, well, I want to then. So you click the insert citation button. A new window opens up. You search for cow genome, you collect the correct paper and then it will tell you it will add the citation for you, right? So it will say we use and then it will say Zim et al. 2009. And but this is a special field, so you cannot edit this field directly. And then if you want to use a different citation style, hey, you can click on the top and you can say, well, I want to use the American psychology citation style or the nature style or the science style or frontiers in genetic style. So every journal has a slightly different citation style, right? Sometimes it's Zim et al. 2009. Sometimes it's just a number. Sometimes they mention like two authors and then at all. But it's different every time that you cite. And if you want to add the bibliography, so the overview of all the citations, then you have this insert bibliography button, which allows you to insert the bibliography. So then it looks more or less like this. And of course, you can then just from the drop down list, select the style that you need. And then, for example, if I want to switch to the genome biology style, which is a slightly different style, it uses a number. And then here the citation gets changed as well. So you can see here that there's like the doi in there, which is not in this citation style. And here you don't have spaces and tabs. So it saves you a lot of space, a lot of time. So if you write an article and then you want to go from one article or from one journal have, for example, you think, well, I wanted to write it for genome biology. You send it to genome biology and genome biology. Say, well, we don't want your article, but you can submit to genome research, which is our sister. Then they see genome research as a slightly different citation style. But the updating of the citations is just a single click and selecting a new one. There are a lot of styles. There are even more styles that you can think of. But when you go to Mandalay itself, then you have to click the more styles button to download additional styles. And this this is an old one because we already had the break, right? So and there's 12 additional slides. So in theory, I was hoping to be here. So version control. So any questions about citation managers? And the reason why I like Mandalay is because it's free and it saves stuff in the cloud. So it's available everywhere. And if you guys have questions or suggestions, then just let me know in the chat. All right. So the last part of the lecture last like 20 minutes, I wanted to spend with going through version control systems because version control systems really have saved my ass many, many times. And I'm still happy that my PhD supervisor, Piotr, forced me to use version control because it's it's difficult to start getting used to version control. It seems like a lot of additional work every time that you change some code to make a commit and make make make a make a bunch of changes and then save it somewhere. But version control comes in two different formats. And the reason why you want to use version control, I think I have a slide about that. So I will talk about that later. But there's two very different types of version control. One of them is the centralized version control like subversion. And the other one is the distributed version control. So in the centralized version control, the repository, so the code or the changes to the code, they all live on a single centralized server. And everyone has. So this is me on my computer. Everyone gets a working copy. So you can commit changes from your working copy back to the repository. And you can update changes that other people have made to the repository. So if you're working together with like three people, everyone gets their own version. And then there's only one truth, right? Because the server repository here, that is the truth version. That's the real version, right? So if if everyone updates at the beginning of the day, then people start working on their working copy, changing pieces of code and every time that they change like something fundamental, they commit back to the repository. But before you can push your commit to the repository, you have to make an update so to get all the changes that the other people did. It might be that the guy working next to you committed like five minutes earlier. So you have to bring in his changes from the server to your working copy, make the changes or integrate the changes with what you have done and then commit the changes back. So the distributed version control is relatively new. Well, not so new, but distributed version control works a little bit differently. So you still have a server, which has kind of the the golden standard repository, but everything is split into having your own repository. So instead of just having the working copy, you get also a copy of the repository. So you can commit an update locally and only when you feel like, oh, my version is now I've added a complete feature, then you can push and pull to the server repository, right? So to the higher level repository, which is online. So you have the ability to continue working on local copy until you feel comfortable to kind of push everything to the repository. And everyone has their own version of the repository and everyone has their own working copy. So it just it just makes that you don't have a single kind of entry of truths or a single golden standard, but everyone has their own repository which can change and which will diverge over time as well. But by pushing and pulling, you can you can update the central repository with your changes. So the purpose of a version control system is to enable multiple people to simultaneously work on a single project. So, for example, I am one of the programmers who works on Gnetwork and Gnetwork is a group of people and we are like between eight to 15 people. When you work on over cloud, no, no, no. Repositories do not get updated in real time because that makes no sense because the idea is that that when you have the idea is that when you update the repository on the repository, a lot of checks will start running. So it will do like integration checks and it will do all the unit tests. And if any of these fail, then your push or your pool will be rejected to make sure that the repository here is always in a working state because people downloading source code will download it from the repository. So the repository cannot be broken at any point in time. So the push and pull will run checks or can run checks. So there's no updating in real time. Now, you decide which blocks of data you kind of change. So the purpose is to work together on a single project. However, it's also very useful when you are just one person with like six laptops, right? I have a working computer, I have a computer at home. I have a laptop, which I take with me traveling and all of these. I want to have the code kind of up to date, but also I want to be able to work locally on that. And sometimes I'm on a plane flying through the US doing some work. And then at the moment that I land in the US, I have Wi-Fi again. And then I bundle the changes and then I push the changes, which I did during the airplane flight. And I also use version control on my own projects when I'm the only one working on it. Why? Because sometimes you break things and version control allows you to go back in time to find out where a certain thing broke. And so one of the advantages of version control is that it integrates work done simultaneously and simultaneously. Here is not real time, but simultaneously means over, for example, the span of two days, right? I can work on adding a new feature to G network, while someone else adds a different feature and a third person adds another feature, right? So as soon as I think that the feature is more or less complete and ready for either production or for testing, I push my code to the central repository. All kinds of checks are run by the central repository to make sure that I didn't break all of the code or put in an exploit and these kinds of things. And only then does the testing server pull the new version from the repository. So and simultaneously means if I work for three days on a feature, then someone else in the meantime can also work on that feature or on a slightly different feature. And then once we start merging the stuff to the central repository, do we need to fix conflicts between what we did? And the main issue that you want to do that is that version control gives you access to historical versions of your project. And since we are working in science, science should be reproducible, right? So I am writing code and then I'm publishing a paper. And then I continue working on my code, right? Because I might be using the same code and a different project. But someone might come in two years time and say, well, that analysis that you did in 2019, I want to redo that analysis. So what they can do is then they can take the version control software, they can roll it back to the exact time point in 2019 when I did the analysis and they can reproduce everything that I did at that point in time, which means that had the going from the data to the graph in the paper. This path is more or less fixed and it is fixed at every point in time because it could be that a database gets updated in the meantime. It could be that I change my code slightly and p-value start changing slightly, right? So to have reproducible research, you have to have everything from raw data to the figure that you are present or the table that you are presenting in your paper. The pipeline or the code which couples these two together should be fixed and should be restoreable. So you should be able to go back in time, but you can also then go forward and backward in time. So it gives you a historical versions of your project. All right, so some terminology when we're talking about version control, a repository is a database containing all the changes. So it is nothing more than changes to a version. So it doesn't say files, it saves changes to a file. So once you first make a file, there's a new entry in the database. When you update the file, there's gold. These are stored as changes to the file, not as a new file. So the working copy is a personal copy of all the files and people also use the word checkout for that. So when I check out the repository, I get a working copy locally, which I can change, which I control, which is not under control of someone else. A commit is a collection of edits on a working copy and an update also called a pool is a collection of edits on the repository relative to your working repository. Right. So how does this look? So I have my working copy here where I make my edits and this is the repository, which is the database of all edits and versions. And I can commit my changes to the repository and I can update or pool the changes from the repository to my working copy, updating to the new version of the software. So there are two varieties like I told you centralized. So in centralized version control, there's just one repository. So there's one database maintaining everything. An example of this is a subversion also called SVM. And then there are different distributed versions. So distributed versions are a little bit more modern. They run a little bit faster. They are less prone to errors, but they are a lot more complex to understand and work with because there are multiple repositories. So you always have to realize, am I working on my repository or am I now wanting to make my repository equal to the repository online? But there are some examples are, for example, Mercurial or Git. So centralized version control, we already saw this. There's one central repository. Each user get their own working copy. And as soon as you make a commit, all the other people can see or all the other people see the changes when you commit. They have to update. So there's no local repository which you can keep stable. No, when one of your colleagues commits a change, the repository gets updated and before you can continue working, you have to update. So you have to get those changes from the repository. So distributed version control works by giving everyone their own repository and working copy. After a user commits, others have no access to the change. So it can make commits independently of other people. So I can make changes and then more changes and more changes and no one can see those changes until those changes are pushed to the central repository because then they are visible to everyone. But they don't have to update. They can say, well, no, I'm sticking with my version of the repository that they had, like this Denny is a fool. He doesn't know how to program. So my version of the repository is the one that I want to work against. So when you update, you do not get the other changes unless you have first pulled in those changes from the global repository. So the structure here is you make a commit. You push the commit to the central repository. They pull the changes from the central repository into their own repository and then they update their code. They working copy based on the changes there. So a note about distributed commits and update commands only move changes between the working copy and the local repository. They do not affect any of the other repositories. Push and pull commands move changes between the local repository and the central repository. And they do not affect the working copy. So they do not affect how code looks on the hard drive. A little bit of a note on it doesn't separate pull and update. Well, it does separate it. But when you do get pool, it actually does pool and update. So it's a little bit of a misnomer there. So then it's not named properly. So version control that's multiple users simultaneously edit their own copies of a project. Can you just pull a selected part of the code? No, no. You you pull when you pull your repository is harmonized with the global repository. So you get all the changes that are done. And to make it even more difficult, it's not just the repository that you get because a repository can have like things like multiple branches, like different versions of the software. Like I have a branch for Windows, one for Linux and one for Android, right? So for different operating systems, there are different branches and these branches are coupled together into a single repository. So it's everything or nothing. Yeah, no, because you can pull and nothing changes for you unless you update. So you can pull in all the changes that people made. And unless you update your working copy, nothing changes with the code that you have written. All right. So because version control lets you edit multiple copies of a project for each line in the in the in a text file or in a code file, the new line is the original line. If neither user edited, right? So if no one touched the line of code, then this line of code is perfectly fine. If one of the users has edited this code, then the edited line is the new line, right? Because edits go before the original. And that's because it's just a it's just a patch system, right? So you have an empty file, then there's a patch coming in, putting some text in there, then there's another patch changing part of the text, and then there's another part changing a different part of the text. However, conflicts occur when two users simultaneously change the same line of code, right? If I would change the line of code in my my work and my friend in the U.S. would also change a line of code there. And then there's a conflict because when he pushes his so when he and so he is my working copy, I make a commit where I change the line of code, I push it to my repository, I committed to my repository, then I push it to the central repository. As soon as someone else now has touched the same line of code, they can make a commit updating their own repository. But when they then try to push their version to their central repository, it will create a conflict so it will not allow them to push to their repository unless they have pulled in the changes that I made. Update their working copy, fix the conflict, commit the fix to the conflict, and then push the fix plus the commit to the repository. So it's kind of a difficult situation or not difficult, but it's just a kind of three step procedure to kind of fix these conflicts. And so manual intervention is required to resolve conflicts. So as long as no one's touching the same lines of code, everything will be merged automatically. So you just say pool and so you just say, oh, I have a new commit. So make a new commit, push it to the online and the repository will merge all the changes from hundreds or thousands of users unless some of these users start touching the same lines of code. All right, so merging changes. When you do this centralized, then updates your update changes to working copy by applying any edits that appear in the repository, but have not yet been applied to the working copy. In a centralized version control system, you can update at any moment, even if you have locally uncommitted changes. But if you update in a central system, so if you update in a central system, you get all the changes that other people made, which might have, which might introduce conflicts into your working copy, right? So then you have to fix the conflicts and make a commit that kind of solves this conflict. But in a distributed version control, if you have an uncommitted change in your working copy, then you cannot run the update step because you have to have a stable working copy before you can update, before you can get the changes from someone else. And so before you're allowed to update, you must commit any changes that you have made. After this, you can run the update, which then can create conflicts and then you have to merge the two sets of edit and then commit the result. So it just tells you, well, this line of code, you wrote this, the other guy wrote this, which one of the two is the truth. And then you just say, well, take my version or take his version and you delete the other version and then you commit the resulting code. All right, so this is more or less how I work. So when, for example, I'm working on my web server, which lives here, the first thing that I need to do is get a repository and a working copy of a project. So I only have to do the first two steps one. So I say, get clone and then the thing where my repository lives. Right, so if I can then go into the directory and from now on I have my own copy. So all the changes as a first get any changes done by others because well, and so when I start working in the morning, I look at my own copy and then I say, get pool and get pool. There's a pool and an update. So I pull in all the changes that everyone else made overnight. Then I start working and working means I repeat the following steps. So I make some local changes, right? So I, for example, add a feature or change some lines of code or I fix a bug. So what happens is then I examine the changes. I can do that using git status. Git status lists me which files have been changed. I can then do a git diff with the file name to show what has changed in the file. I can then add the changes that I made with add. So I can say git at file one because it might be that I changed code in file one. Then I create a meaningful commit message saying commit minus M. I added a new feature and then I update your version with the changes pushed by others in the time that I was working. So I do a good pool and then I publish my changes doing a good push. And every time that I do a good pool, I might run into a conflict. So when I do a good pool, other people might have created or other people might have worked on file one as well. And as soon as they hit the same lines of code that I hit, then I have at this point when I do the git pool, I have to make a merge. So then I have to fix any conflicts that occur. And then I push my changes to the local repository, meaning that other people from now on have to do it when they do a pool. They might have to fix the conflicts that they introduce towards me. And of course, this is just something that goes round and round. And I do this sometimes 20 times a day, sometimes 30 times a day, sometimes zero times a day. It depends on how much I'm coding. But the way that it works is I do a git pool, get all the changes that other people made. I go through this loop every time. And the other people do the same, so they just go through the loop. All right, so some best practices when you're using version control, use a very descriptive commit message. It takes a moment of time to write a very good commit message. This is the thing that goes mostly wrong. People say, git commit minus m updates, and then no one knows what's getting updated. So commit messages are more or less your documentation of what you have changed on the code. So a good commit message is added the implementation of feature X. Or fixed bug number 16 or fix the bug as reported by this forum post or these kinds of things, so they have to be logical, concise and short. And this is useful for someone is examinating the change because it makes the purpose of the change clear. Each commit should be a logical unit. You shouldn't change a hundred files and then just commit all of the changes in one go. No, commit should be small. So you should take like, OK, so I add a new file and to import this file, I have to change one or two other files. So commit should be as small as possible and should be a logical logical unit, so I'm working on fixing a bug or adding a new feature. I don't add a new feature and fix a bug in the same commit. That's not how this works. You do step by step. And so you should avoid indiscriminate commits. So do not commit all the changes at once. If you're having changed a hundred files, then first separate out what are the logical units like I fix the bug in file number one. I added a new feature in file number two. I changed the wording in the documentation in file number three. So all of these are individual separate commits. And so incorporate other other people's changes frequently in this prevents conflict. So every time that you start working or when you are working, especially if you're working in different time zones and make sure that you get the changes from other people as often as possible, which also means that you do you make branches for each individual update fix as well? Yes and no. Yes and no, it depends. So for my web server, I have a single branch called development and the development branch accumulates changes relative to the master. And when I am happy and they have a new version, which I deem to be stable, then this branch get merged into master. So, for example, for my web server, I only have two different branches. But for my 3D engine, I have like four or five different branches because sometimes I'm working on a feature which I'm not going to complete or might not complete. So, for example, Android support for the 3D engine was its separate branch, which got merged back when the Android support was completed. But in the meantime, I also did other changes like bug fixes, which then went into individual branches. So it really depends on what you are working at. And especially since, for example, the web server is available online, right? And people can kind of follow the repository. I try to never directly push to the master branch because if I make five commits, then people would get five emails a day of me changing something. So that's why I use a development branch and I just accumulate all kinds of different versions, bug fixes and new features. And then I merge it back to the main branch in one go. And I also rebase so that in that case, there's only one big commit that other people see and not the hundreds of little commits that I do. Does that answer your question? So it depends on the project. If I'm working with other people, then I generally work in branches. If I work on my own, I sometimes work in branches, depending on the software and depending on what I want to add or what I don't want to add. For example, my HAU Berlin repository, which is all the code that I wrote to HAU, it only has one branch, just a master and all commits directly go to master. Because I'm the only one working on it and I don't accept changes from the outside. All right, so some best practices is coordinate with your co-workers. Right? I used to work at the UMCG with Molgenus. Molgenus had like 17 different programmers. And when we would come into work in the morning, we would just sit down and have like a little chat. So what are you doing today? What are you doing? What are you doing? What are you doing? So that we did not start doing the same thing or work on the same bug, which sometimes happened that you didn't know like someone was working at home. He fixed the bug and I was working and I also fixed the same bug. And then we only figured out when we tried to commit the changes because then one of them was first and that that so. Version control tools are line based. They work for text files. So you should never, ever add binary files like Word documents, PowerPoints, images, well, images might but only put under version control text files. That is what they are there for. Do not write excessively long lines, especially in commit messages or in or in code, but in commit messages is very important because as a general rule, try to keep each line a maximum of 80 characters. And this 80 character, it stems back from the 1970s when we were all still doing like DOS because the screen would only show 80 characters. And the rest would either go off screen or it would start wrapping around, which is really, really annoying. Never, ever, ever commit files which are generated because they will change every time, right? I run the generator and I put this in because more genus use to generate a lot of code. So it is a software package which generates a web interface based on a database structure. And I made this presentation like years and years ago when they were moving from SVN to GIT to kind of explain them what the differences were between centralized version, which they were doing to distribute it, which was kind of the future at that point. So version control is intended for files that people edit, generated files should never be under version control and never commit binary files that are the result of compilation, because that's actually a double no-no, right? Because you're committing a binary file, like something like an X or a DLL or an SO or a dilip. These files have nothing to do in version control, so they should never be under version control because they are not line based, right? In a binary file, everything in theory is on one line. So if you change a single byte, then the whole file has changed. And you're just making the repository like gigabytes big in the end. All right. So often your version control system can ignore certain files. So in GIT, this is called the .gitignore file. And you can just add extensions that you would never want to commit. So, for example, in my, let me see. I can show you an example here. So I have here a .gitignore file. So this is my web server. I know this is a no way link to biology, but we do add generated files in version control. Our CSS gets generated compiled from CSS. Yeah, shouldn't do that. Should just have this scss files under version control. Same for minify, yeah, shouldn't be in version control. Generated files have nothing to do in version control. Because you can generate them at each time and the generation might be different depending on which tool you use. If you use a different minifier, the file would look completely different. But the content of the file is the same. So you should never generate a file should not be there. It's it only should have text files. But about the .gitignore file. So I have a .gitignore file right. And this ignores my websites that I'm hosting. I still have crossland. But also like the executable files, the test executable. I don't want any of the certificates in there. So certificate, of course, you should also never commit certificates, right? Like if you have a public key, private key, you should not commit the public key or the private key. And you should also not commit passwords. So because my server is running HTTPS, I have a server certificate and a server key. And of course, I never, ever want to commit them by mistake because that would kind of blow up the security of my whole server. So I just put them here to make sure that that never happens. And I never want to get these are my generated files. So my generated files are .in files, which are input to the web server. Error files, list files, log files, definition files. Never ever commit those and also don't commit any QR codes that the website generates because those should not be in there as well. But that's the and also never, never put my let's encrypt script in there for security reasons. But you can create these files. So never tell your version control system to ignore certain files that you should. So add a filter so that you cannot commit these types of files and never force it. If a version controls if a version control system refuses to do a certain action, for example, push to the remote repository, you should figure out what is wrong. You should never, ever force push your local version to remote because this breaks everything for everyone. So never, ever do that. So never force push a version control system. All right, I hope you guys are not too too worried about the fact that we took 10 minutes longer. Just remember that the lectures used to be four hours instead of three. So you're already lucky. But today we talked about DNA meta barcoding. We talked a little bit about PubMed, about Medline, about Web of Science, Google Scholar, ResearchGate, H-Index, I-Index, Scientific Reference Managers, that citations are there to keep people honest and to attribute who invented what. We talk about reference managers like N-Note and Mandalay. And I told you a little bit about version control and that you should use version control. And of course, if you have any questions about version control or these kinds of things, then just let me know, like for things like GIT and Subversion and Mercurial, there's like a lot of tutorials online, which you can read. But I just wanted to give you like a little idea of why you should use it. Because reproducible research means you have to stabilize things and you have to be able to go back in time because I want to do the same analysis. It was a very interesting lecture. I have a new view now on citations. Thanks. I will stop the recording for the people that are watching.