 So welcome everyone to this workshop on computational reproducibility or that we're going to focus a lot about, you know, why the code doesn't reproduce. It's based on what we've been working on at the journal I'm editing and well, let's see who is going to do most of this talk actually is the computational reproducibility checker and the journal assistant at. So, yeah, next slide. So we're going to talk a little bit about today about our computational reproducibility assessment or what we like to call like reprojects. We're going to talk about common errors in computational reproducibility and that's going to be based a lot about the errors we see when we are looking at this for the submission we have and we're going to look at some of the best practices that we have discovered for how to make the analysis reproducible. Then we're going to have an exercise, which will be about an hour. You will have some time to have some little coffee break or something during that time as well. So where our intention is for you to work in like in in pairs how we can team up with actually working hands on with seeing if we can get things to reproduce. And then we're going to have like a joint discussion about the challenges and what we learned from this part. So that's the full two hours. And before we start with the presentation I think it's good to have like a quick overview of the exercise. So this is what you will have in mind. So think of this like a test, you know. After the presentation we will want you to work with and look if you can actually look at reproducibility for some submissions, some articles. So keep this in mind that we want you to try and reproduce code from paper that is accepted publication in Mexico. And we want you to note down what's causing troubles in reproducing the paper. And also note down what you find that is actually correctly done at working well. And then finally if you have time to compare results from the code output to the report results in the manuscript. Yeah. Okay, so next question is then why do we even care about this? Why is this important? Well, as Nosek talked about during his keynote speech transparency is a way out of replication crisis. But transparency only allows us to assess the reliability of research findings. It doesn't automatically mean that everything will just start working. So we're doing a lot of work towards transparency, open science. But that can, in the worst case, that can mean that we're showing everyone how unreliable our science is. That can, as he talked about, reduce trust in science or make it in the middle ground. So it's also very important that we're using this new transparency openness to take the time and actually assess our research findings in more detail now that we actually can. And so far, much of the focus has been given to replication. And replication is when you are doing the same type of analysis but you're doing it on a new dataset. So you repeat a study, collect new data, run all through it and then you see if you get the same results. And as you understand, there are many reasons why that replication might fail. It could be a false positive or it could be a problem with analysis in the previous part and so on. There are many steps towards that. But of course, that's a very interesting end point. So it made sense to have a lot of focus on that. That's why we often call it the replication crisis rather than a reproducibility crisis. Also quite a lot of focus has been on robustness. That is that if you do new analysis on the same data, you should get very similar results. This is where the whole garden of the forking paths, p-hacking everything that comes in. Well, look, these people that have done these specific analysis on this paper when we do something slightly different, the results are no longer significant or I apply Bayesian methods to it, they come to an end of conclusion. And there are some cool projects there, like for example, the men analysis project and similar, where we're really going into detail, looking at how much do results vary, how robust are our results. And we have among the more extreme example of the multiverse analysis where you're essentially trying to create how do you do all types of analysis on the paper. We have the same data, but we do new analysis on it. And we want things to be fairly similar and then we believe that our findings are robust. But so far, I would say that less attention has been given to the most basic part of it. That is, our data analysis and reporting should be free from errors. And this can pretty much be summarized into computational reproducibility. And computational reproducibility is strictly speaking that assuming that we have a data set, that is correct, then someone should be able to apply the same analysis to it and come to the same conclusion and reach the same results as was before. Then we have computational reproducibility. It is not strictly saying about checking that the data is correct per se, but data processing, cleaning and so on is under computational reproducibility. But we're not looking at the veracity of the raw data or the primaries data or the source data. So things like research fraud, research misconduct when people have created data should not fall into this. However, things like that, screening for outliers or removal of outliers and so on definitely falls under computational reproducibility because that is a practice that if transparently done and then someone can reproduce it, then we know that, well, you got this result here because you removed two outliers. Then that's not necessarily a problem, right? And if we look at things like questionable reason practice that are about that you have just been data dredging and looked at and went star hunting to try and find what's significant and so on. These practices aren't necessarily so problematic if they are, if we can fully reproduce all the steps. It is problematic then if someone has run tons of analysis and then they found something that worked in the end. That's not necessarily a problem because if we know how to come to it, we can actually assess it. So by actually, that we can have full computational reproducibility is also a way to actually tackle things like questionable reason practices. So it really, we gain quite a bit from it just by checking that we can follow on all the steps going through a data set from a data set to what is reported. But despite that this is something that is fairly similar versus that we go out and do all new studies which is, I mean, contacting new replications all the time. That is perhaps feasible in say social psychology or decision making research when you have like simple vignette studies when you have Linda is a bank teller research or when you have trolley dilemma or seem like that then it's no big deal. We can just repeat the experiments if we get the same results. But if you have done a classroom intervention for example across 30 schools these are very expensive projects or you might have been studying students who have learning disabilities and so on which are populations that are hard to study and this data is precious then it's super important that all analysis on it actually turns out to be correct. So as we move into areas in the replication crisis where data and our experiments or so are very precious which I would say is the case in education much more than let's say it's our psychology where a lot of replication and so on started then maybe we can't rely as much on replication but we have to really make sure that what we actually reported works. So this is where competition reproducibility checks come in but at this point they are still very very rare. So the slide, yeah. And this is where we think that we are being pioneering in this in the Dion meta psychology. So meta psychology is rated number one on top factor. I'm not sure if everyone knows about top factor but it's kind of like an alternative to impact factor whereas instead of citations we look at how transparent and open journal practices are. So we look at things like data sharing pre-registration reporting guidelines and everything like that. And there are several hundreds of journals ranked there and you know most of social science journals but also journals like science and nature and so on is on that. So every journal gets audited by a team connected to the central open science. They go through all the journals and see what kind of open science practices are enforcing. And we are currently ranked now one on that but there are some other good journals that are also very very close that are relevant that can be relevant for educational research or education as I call it anyway. And one thing that I have started to actually look at recently is looking at whether computational reproducibility checks are actually being done. And Metasarchology is a community run journal. So this means that we are not owned by the organization or like a society. We are also not owned by one of the bigger publishing houses. We are not connected to Elsevier or Taylor Francis or Spring and Nature or anything like that. Instead we are self-owned by our editorial board and it's a journal that is connected to the community and people who are invested in trying to make research more transparent and increase research quality in this area. We don't have any Arctic processes charges and it's open access. So it's free to publish with us and it's free to read. And in general we publish anything in psychology. And by Meta we read something very wide. So jokingly like jokingly we can say that we publish everything that most journals don't publish as typical. But the bread and butter that is a typical experiment or a typical study typical empirical study in the most common journals that is what we don't publish. But we publish everything else like reviews, commentaries, replications metascience perspectives and so on. And actually one of the things we are going to start publishing very soon is something called verification reports where you can actually take a look at one study that is published. It doesn't have to be published in our journal it can be published in another journal. So let's say that you find that well here is a very, very important study done on for example a classroom intervention large scale study done that is published in let's say analysis of your journal like a top journal field or whatever and it's highly cited it's very important very important for the field to know that this data that the data and report results and analysis are correct. Then you can actually submit that as a verification report that you will very much like to keep in the registry report format. So and then basically you will audit that study to see is the computation reproducible. So this is actually a way to get the published article out of doing whatever we are doing routinely. If that might be the case you would have been a very, very highly published author by now. But of course this is not something we are going to be prepared to do for other articles routinely because this is really a task the journals should do themselves but it is interesting for important studies to do these verifications for other journals as well. And education psychology is an area that we think are important. I mean I'm editor-in-chief at this journal and one of my main areas right now is in education psychology and I'm my co-workers representing on later today on visible learning and you know John had this big project and looking into whether that is reproducible and whether that is correct and so on. So education psychology is something that we are really interested in you know getting more into and publishing more of these meta content into so it's a little bit of an advertisement for the journal as well. But in this talk what we are going to focus now on is our five years experience with doing computation reproducibility checks at MetaCycology and let's see you have been working with us for two years now right? A year and a half maybe. Yeah and you have been doing quite a lot of reproducibility checks during this time because you kind of came in when we had like a huge backlog and so on. So you are the person who has done most of these reproducibility checks at the journal in this time frame. So I'm going to talk a bit about what I do at the journal and mostly what the process is like and what my experience has been so far I think I've reproduced maybe 10-ish papers so far maybe even more. Just an overview of the articles that we have reproduced so since the journal has started there have been 47 articles that were reproduced and by this we mean we reproduced all papers that actually had something to reproduce. We also published commentaries or theoretical papers that don't have analysis in them but every paper that has something to reproduce we have done reproducibility checks on them. And last year we had 13 papers that had to be reproduced and a bit of statistics on these are that papers that reproduced without any error which means I was able to find all of the data and materials easily and just click and run and it would reproduce without errors. It was one study that had this and there was another one that didn't have any errors but it was a large simulation study so we had to change the number of iterations so this one still didn't have any reproduced but we had to change it a bit to actually run it. For errors in code there were eight papers that had errors in code like missing library calls or some kind of problems with knitting the markdowns we even had container emulating packages like RM that caused issues with loading some of the packages there was a lot of issues with deprecated functions so there are always some kind of code issue when running especially if there is a lot of analyses and a lot of different code scripts and then we also had errors in reporter results which means that the output didn't equate to what was reported in the paper and those were found in six papers those are usually really small and don't change the outcome or the interpretation of the paper it's like wrong sample size but maybe like a number up or down wrong mean again it's usually not a large difference there's some plot output discrepancies and usually it's when there are copy paste errors or when a package updates and then changes the output but all of these are errors that happen we don't have a problem with reproducing things because things are missing and we also do all of this in collaboration with the authors so even when we have all the data and materials and we know what we need to do and authors are willing to cooperate there are still issues and we still sometimes struggle reproducing these papers especially the analyses were done a couple of years back then a lot of things went wrong when trying to reproduce them a couple of years later since I believe 2021 there has been this reproducibility reproducibility checklist included in the journal process of doing reproducibility before it wasn't really standardized in how reproducibility checks are done it was usually some kind of document that the reproducibility check code would write out with all the errors that they found and then we switch to a bit of more easier to follow and easier to do standardized checklists which usually means who reproduced it their contact info in case anyone is interested we also state the ID of the submission then we check if we can actually load in the data the way that they have presented it if they archived the data in a way that it's archived in format so it's not an SS file of data sets it's an actual TSE file for instance and then we check if we can find a code to reproduce figures, stables or index numbers we check if there are any requirements for reproducing the papers that we didn't know of for instance if we had to have stronger computers to run the simulations or how long it takes to actually run this and then we state which software we use we also state our computing environment because this is one of the biggest issues when it comes to reproducing and then we also describe if we were able to find that everything replicates and also then we comment on where we found the issues whether the author has actually fixed it how they fixed it and what else any comments that we have on it and these checklists are published alongside with the peer review report so anyone once the paper is published you can go and see the checklist and see what the issues maybe were and also there is a peer review report at the end stating if everything we produced in the paper and if there were any issues along the way and how the authors actually fixed it so there is it's necessary for authors to fix issues for the paper to actually be published although sometimes these things maybe cannot be fixed for instance it's simulations that will take a really long time to run then we have to change the code those are minor issues but usually they have to fix everything and the code has to run and reproduce and our main concern is that at least once we can reproduce the code and the results before publishing the papers we don't expect it to run and for others to actually be able to reproduce them later because there's just so many things that go into it that it's sometimes really hard so some of the common errors which some of I have already mentioned but the common errors that we found were that happened when people renamed files and then don't change it in the code and then it's just a bit of annoyance to see where the files are and then trying to fix it yourself and others have to fix it it's also common that functions don't work for reasons it can be not calling in the libraries or the functions no longer exist or they didn't use them correctly or save them correctly in the script and then they didn't work once we try to reproduce them it's very common that they hard-code the file paths and then it's just easy to change but it cannot be clicked and reproduced at once and also a big issue is older software versions especially for our wanted updates that cause issues so the common reasons why they don't work the functions don't work are usually incorrect input arguments undefined our scope variables that are in missing or outdated dependencies this is the most common one it's very common that they don't load libraries or state what the libraries should be also deprecated functions are a big issue in R also in proper syntax we're using wrong functions when they try to change the code from a non-open source for instance if it was done in SPSS and then they try to redo it in R so we can reproduce it because we ask for them to be in an open source format and some of the functions may be not work not work properly then we have the hard coding file paths which is usually the most common reason but this is something that's very easy to fix it's just something that causes this issue that it doesn't reproduce immediately and then we have the software versions so it's either not stating which version of the software was used and this causes troubles so it's common that they have a different version that someone else has and it's very hard to have people to use containers to reproduce things especially if they're not very confident in using coding and also on the other hand some packages may stop working or being maintained especially they're rarely used and not commonly maintained packages and those are most common reasons for code not working and then we have the reasons for results not reproducing in a way that the output just doesn't equate to the code output and most common ones are copy paste errors, wrong rounding and simulation based analyses so the first one is especially common when they do bootstrapping for confidence intervals they often forget to succeed for this and this is a very minor issue it usually doesn't change the conclusions but it's something that doesn't reproduce because they don't set the seeds and the same thing for simulating datasets although this is a bit rarer that they don't set a seed when they simulate datasets it's mostly for confidence intervals a different thing is wrong rounding and copy paste errors this is the most common issue it's just it's very self-explanatory that it's just easier to copy paste outputs and when these outputs change we don't even know why sometimes the rounding is so off that it doesn't even seem like the mistake was in rounding it's like the number just completely changed and it could be that the code changed maybe but copy paste errors are very common and based on these we have some best practices and I think these best practices examples would be more to make it easy to find out what the error is once it doesn't reproduce rather than having it reproduce because even with the best practices there could be issues there were really some nice scripts that had a really really silly issue and that it didn't reproduce so I guess these best practices are more about making the structure so easy to follow where the error is once it appears and the first one is importance of structure it is not very common to see really nicely structured analyses and projects in general and one resources that could be useful to psychologists is the site data sets it is work in progress but it's good to use to learn how to name the data sets how to structure the folders how to structure your projects to follow standardized naming conventions and so on and this is one of the examples of how to create projects that are easy to follow and to have standardized names and in that way we can follow which data sets belong to which analyses and then we can look into where the code should be to reproduce things and why it doesn't work another thing is time naming conventions this is also not commonly practice it's a big issue when trying to load in data especially if they're renaming files later on this is where a lot of the issues occur in loading data that are differently named or named not following a convention so there may be spaces in between words that we cannot see and then it's very hard to realize that there is an error because of it and it takes a lot of time to find this error and realize why so following some of the file naming conventions is the best way to avoid this and also have consistency in naming a file probably the best way to avoid a lot of the issues with copy-pasting errors or in general anything and making just a really nice easy to follow workflow is to use markdown or quarter some kind of reporting way of writing code in a way that we can just click and run and then compare that output with the output and that should be the easiest way to get rid of things it is still something that is fail-proof sadly it can cause a lot of knitting errors especially if for instance a papaya package which is used to create reports following APA standards is a really nice way to avoid copy-pasting errors for instance because it can create data tables for analysis however this might not be the best thing to use if you have more complex analyses and in that case you have to go and change things add things have different scripts that produce more complex analyses and then you still have these issues that could lead to not reproducing also via requires laid-back or other softwares as well which can cause troubles depending on different computers that try to reproduce it so this can also lead again to not dating properly or causing errors or some of the code not working because it doesn't integrate with your operating system so all of these practices are something that makes it easy for you to go and see what the error could be and see how you actually got to your results from the analysis but it's not something that really makes it fail-proof for the analysis to reproduce and this is where readme files and codebooks come in handy and these are almost never available and I realized these are really take a time, they take a long time to actually write these properly and it mostly people think they're waste of time but it's so useful to have some things in your readme file like computing environment saying how long it takes to actually compute things so we know if we need to run it on a server or on our local computer instead a project description what which software and which libraries you need to have installed before trying to run it and also having a codebook that tells you which variables do what and describe them so we can try and follow it from the data set through the analysis and we can actually follow what is being done in case you don't use like a markdown to create results instead if you have scripts it's so much easier to navigate through these scripts and trying to find which code chunk reproduces which result in the paper and another thing that would be very useful in these cases is commenting the code heavily saying this code chunk will create figure one on page wherever in your manuscript and in that case you can go and try to find it more easily but usually just navigating through the scripts and the data sets is an nightmare sometimes so this is what also causes issues in reproducing if you have to go and fetch the results from a list in a different object and then trying to compare it like the confidence intervals are in this object but the better coefficient is in another and then you have to try and fetch it on your own this is where you lose a lot of time and it's very hard to follow and this is where mistakes occur if you copy place them into your manuscript. Last one is making containers this would be the best probably practice because this would probably ensure that everything reproduces it will be really good if people knew how to use Docker I have issues with Docker one other way is to use code ocean which is like an online type of Docker where you put in your code and it will always reproduce the results as they were so it saves the environment it's online everyone can just come in login and click on it and it will always reproduce the same results because it's just a frozen container of your analysis and this is probably the easiest way to make sure that your code will always reproduce it does come with some applications for instance I think for researchers you get like 10 hours a month free to run it is that true I think around 10 hours if you're a researcher it comes for free and then you kind of have to pay for more another thing that could be similar to this but I've had issues with these our packages these are packages that save the version of the package that you used and I have found that if this package no longer exists it can really break your whole script so it causes sometimes more issues than it helped so Docker and CodeOcean are a really good option our packages that do this I'm not so sure that they actually help over the course of time and of course there are point and click alternatives to this so if you're not comfortable with coding you can still have your results be easily reproduced with if you use for instance JASP or Jmovi because these files can be saved with the analysis and you can follow which analysis were done and what was used to actually accomplish this you can also see the data sets and how they were transformed and you can save all of this in a file and follow it back and it's a really nice way of having transparent analyses if you're not comfortable with coding but still even though you have the data sets in the same file the actual raw data sets should be separately provided as CSV or something else okay so these were the main issues that we come across and we only come across them because we usually have the data and materials and the help of audits so if we had to reproduce things that we saw in different journals we might have different issues with reproducing one thing that is a problem is if the data sets are from some websites that they cannot legally have the data sets for in that case sometimes those links break or those data sets update or they change and then we can no longer reproduce the results that they have them but otherwise these are the most common issues and practices to maybe combat these things so we would have a task for you I have put in the chat link oh there's a question yeah I believe Docker is more sustainable because it is used in computer science more I don't know about code ocean that's a really good question because if it is on a website it might just shut down at some point and it will be gone along with it so that's a bit more unsustainable I would say Docker I think would be a bit longer here because everything kind of depends on Docker if I'm correcting that okay so in the chat I have put in the link to the OSI project with three different papers one had analysis done in Jmovi and two had them done in R and you can try and reproduce them so I have put in links in the Remy files of each with the preprint or the manuscript and the links to their data and files and materials and I have also put in the checklist file so you can try and go through the process and see if you can actually reproduce these results just based on what they offer which is usually just the materials or if they have any files or something else and you can see which problems you have witnessed they could be different problems you might have troubles reproducing even the papers that we have published and then after an hour you can come back and we can discuss what you found