 research symposium title, that'll building a research culture for replicability and reproducibility in the social sciences. My name is Alexander Bogdanoz, can I be your moderator today? And today's presenters include Fernanda Hosses de la Guardia, Nick Foxx, Olivia Miske, Anna Trisovich and Jane Benjamin Chung. Just a quick brief overview of the agenda. We'll hear four talks today and we will have some time for open Q&A at the end for with all of the panelists. However, we've asked all the panelists to also reserve a few minutes at the end of their block for quick follow-up questions. So all the four presentations today we'll talk about reproducibility and replicability, obviously. However, there will be targeting, there will be addressing this topic from several different perspectives. Starting off with Fernanda who will be talking about reproductions or post-publication audits of computational reproducibility and how they could be used in the classroom. We'll then hear from Olivia and from Nick who will be talking about crowd source post-publication replications and reproductions conducted as part of the systematizing confidence in open research and evidence or score project. Then Anna will talk about reproducibility from the perspective of data repositories and we'll also share some tips and recommendations on how to make replication materials more reproducible, reusable and extensible. And then finally, we will hear from Jade about internal replication. This is a tool from the perspective of labs which is helpful for detecting errors and publication bias and bias prior to before submitting the work to a journal. And just like a quick note on terminology you will hear replicability and reproducibility used almost interchangeably today. A common definition for reproducibility is the ability to obtain consistent result using the same data and methods as the original study whereas replicability means obtaining consistent results across studies and answering the same question using different data and or different methods. And as I said, some of the presenters today will use these terms almost interchangeably. This is, there's nothing wrong with this. This just demonstrates kind of like the diversity of approaches and of approaches to reproducibility across different disciplines and scholarly traditions. And finally, before we get started, at any point today, you can feel free to ask your questions post them at the Q&A box which you can see at the bottom of the screen. And I will relay your questions at the end of each block with the presenter. And as I said, at the end of each block we'll allocate a little bit of time for open discussion. So at that point, please feel free to raise your hand if you'd like to ask your question live and I will call on you to unmute and speak. And then in the meantime, please upload questions you would like to see answered because this helps us stay on top of things and prioritize things accordingly. So with that, I'm gonna stop sharing my screen and I will introduce my colleague, Fernando Jose de la Guardia, who's a project scientist at the Berkeley Initiative for Transparency in the Social Sciences of BITS where he does research on computational reproducibility and also pushes the tools and practices of open science in the domain of policy analysis through the approach of open policy analysis. So Fernando, thank you for joining and take it away. Thank you, Alex, for the very nice interaction. I will share my screen now. Let me... I'll put my slides also in the chat. See if I wanna follow through. Can you see me passing the slides? Yep, looks great. Okay, great. So thank you again for the opportunity to present in this fantastic conference. As Alex says, I will be talking about our project about Accelerating Computational Reproducibility or ACRE using the Social Science Reproduction Platform which is the main output of this project. As Alex said, we are both at the Berkeley Institute of Transparency for Social Sciences or BITS where we basically work towards increasing the credibility of social sciences and its connection with public policy. This is the core team behind ACRE. We have many others contributing, hopefully fully credited in that link and we are part of our larger organization for the Center for Effective and Global Action. The main motivation to talk about computational reproducibility here today is the standard motivation. Almost 90% of the talks that I've seen around computational reproducibility start with the Clairville principle. And for those of you who have not heard it, this is basically the idea that we should not think about the paper as the main scholarly output. We should think of the paper as the advertisement of the research and the entire scholarly output should be more the entire computational environment. So instead of putting the quote once more time, I decided to minify the Clairville principle, which is basically the same idea. Typically the manuscript is just the people of the iceberg and the scholarly output should be the entire item. So with that in mind, we think that there's a big loss opportunity in the sense that every year around the world we see graduate students across fields doing empirical or applied work, doing an empirical labor or apply social psychology or some type of course, where a typical assignment in their class would be to reproduce the results of a paper and possibly testing the robustness of those results. So again, each year the students around the world will go through the motions of doing something like this. They will look at the iceberg, they will just check out if there's anything like a reproduction package, is the data, is there code and answering that is knowledge. They are generating knowledge whether or not this reproduction package exists. Then they will generate new knowledge in the sense that they will assess the degree of reproducibility of a part of a paper. They might improve it, they might fix and pass, they might update some libraries, they might write some code or other things. And they, moreover, they might also test results, test the robustness of results. And particularly they confirm that the results are robust to additional specifications. It's rare that this will be written up into a self-contained paper. But all these pieces are new knowledge that are generated all the time. But unfortunately at the end of the semester, they are buried in some type of presentation or something like that and they're not preserved. And what we're trying to do with the social science reproduction platform is to provide an environment that would allow students or producers around the world to record this knowledge. So the framework of how we think about this is that we wanna move a little bit away from binary judgments of, we understand that it's easy for this exercise to gravitate towards adversarial exchanges. And we understand also that eliquarier researchers have incentives to emphasize unsuccessful reproductions, but also that senior researchers, original authors, have a position of power to deter these type of repractions. And on top of that, the media focuses a lot on eye-catching headlines like things to reproduce or don't reproduce. So we don't wanna say this. We do not wanna say paper X is reproducible or irreproducible. What we do wanna say is something more nuanced along the lines of result Y in paper X has a high or low level of reproducibility based on several attempts. And moreover, we do wanna say when somebody improves the reproducibility of paper. So we wanna keep track of that and we wanna allow people to get credit when they do that. So the way we approach this is that we follow this idea that comes from the SCORE project of basically breaking a paper into planes. And on top of that, we suggest that each claim is gonna be supported by specific display items, tables or figures. And each display item's gonna contain several specifications. So we're gonna ask reproducers to identify what are the ones that are the key specifications that will support the claim that they're interested in. And in doing this exercise, the main challenge is gonna be around standardization because this exercise is conducted around the world and around across fields. But as Alex was saying, these concepts differ. People use different words to refer to similar things. So basically we, together with the platform, we created a set of guides to try to standardize as much as possible this exercise. And we hope that the proposed framework would be helpful for students across fields. So with that, now I would like to show you the result of our, basically our proposal of how to conduct this, these reproductions in a standardized way. And for that, I will invite you to go to the social science reproduction platform, socialsciencereparation.org, or you can follow with the screenshots that I will be showing you here. But I'll just be walking you through how a example of reproduction could take place. So this could be the reproduction exercise will go in stages. We'll start from selecting the paper all the way through checking robustness. And I'll just give you a brief interaction of how these stages work. So on a first stage, the student or reproducer will be either assigned a paper or we'll select a paper. We want to distinguish this stage in which the reproducer will identify if there is any type of reproduction materials that we will call reproduction package in order to move forward. At this stage, we recommend that the reproducers should not even read the paper. They should just check. Is there any reproduction materials, yes or no? Is there a metadata also on the paper usually recorded around the digital object identifier or the unique stable URL of a paper? And with that, we recommend that they move forward. So and they will go to a platform and basically in the first stage we'll find something like this. They will find the option to enter a DOI or digital object identifier. Once they enter that, the platform will check on a crossref or a biographical service to check the additional metadata and we'll pre-populate this. And then they will be asked if there is a reproduction package available. If there is, they can move forward. If there is not, we ask them if they have contacted the authors, if they intend to recreate the reproduction package from scratch and if that's the case they can move forward through the past but they can also abandon this paper but leaving a small record in the sense that they have created knowledge, right? So they have learned that a paper might not have a reproduction package and they just added that knowledge they just recorded this for other people not to have to run into the same issue again. So once they have a declared paper once they have identified a paper that contains reproduction materials then they can start reading the paper. And this is the part where they will identify the claim the piece of the paper that they wanna reproduce. So the claim could be in the abstract could be in the interaction but it also could be in the results section. It could be described somewhere deep into a paper. And so here the student will basically identify according to them what is the key part of the paper in this case, this section in closing red and in their own words, sorry, then they will identify where which display item in this case table two contains this supports this claim and in this case table two contains in column two the main specification and there's an alternative specification in column one. And then they will go to a platform and record in their own words, what is the claim? What I think score will call a claim extraction and in addition to describing the claim in a same way then they will identify what are the estimates and what are the display items behind the testing. So after selecting the paper, they move to scoping after scoping then they move into assessment. And the assessment, basically the idea here now is that now you have read the paper and now you're gonna sort of like peek under the start looking under the water, right? So you're gonna start looking at the entire iceberg that I was showing you. So here they will, a reproducer will go and look into the reproduction materials in this case a seed file that they open and it's lots of folders that contain several files and if they're lucky or the original authors left a readme file that will make their job much easier because the readme file usually will tell them what are the key files that they should be looking at. In this case, there is a readme file that points to that script that basically says this is the one that you need to reproduce table two but once you start looking into the script you will notice that there are several files that you need that depend in a not easy to see way. So, but in the platform we will ask you to record how the inputs and outputs depend in a regarding a particular script in that you're tracking and then the platform will generate something like this will generate a reproduction diagram that will allow you to see how the different pieces interact in order to reproduce a given display item. And once you have a better idea of how this iceberg looks under the water, you will try to run it. And once you run it, you might run into troubles. It might run all the way from raw data and depending on the results of that exercise you will assess the computational repulsibility of this specific display item. And in the assessment, we have created this 10 level assessment scale that is subjective that we where we have prioritized certain pieces over others. We think that having raw data for example should be what should allow you to access the highest levels of repulsibility and then basically you will be able to record your assessment for a given part of the paper. Not only that, but then you can move on and record improvements. You can also record robustness checks. You can do different types of additions to the exercise and we wanna allow the reproducer to record as much of this as possible. So in improvements, the idea is that reproducers could create an improvement at the paper level. They can improve the readme file for example. They could improve their possibility behind a specific display item. In doing this exercise, also they might have learned that a good improvement X could be done in the future but they run out of time but they will leave a record as a paper trail for people who come after them. And in robustness, they will also be able to increase the number of feasible robustness checks but they could also increase the number of specific justify the reasonableness of a specific test. So this will require a little bit more of in-depth knowledge of what's going on with the paper and justify how good a robustness test is. And there is a connection between the degree of robustness checks that you can conduct and the reproducibility in the sense that if you have a low level of reproducibility or level one, you will not be able to conduct any robustness checks. And as you increase in levels, the size of feasible robustness checks will increase until by level 10 it will enclose the entire paper and more tests to conduct. And this also we see as so like the next frontier after accelerating the computational reproducibility of entire fields, then we can move into more systematically test the robustness of entire fields or subfields of literature. So that will basically will define the entire exercise of doing a reproduction. And this is something that we have seen as something as a common exercise that happens across different courses and disciplines. And now basically the reproducer will be done and they will complete the reproduction. And this is something I wanna spend some time because I think it's where I think the value of this exercise comes to light in the sense that once you complete the reproduction you submit it, you can make it public, completely public you can make it partially public and choose a temporarily anonymous setting and other options. But once you make it completely public this is a public object that you can cite that you can share that you can exchange to improve upon. So what you will see with a completed reproduction you will see on the top a brief summary. Basically I will tell you the name of the reproducer when it was done, the paper that it refers to how many claims they went after, did it do any robustness checks, what's their own definition of the claim and things like that and links to the original reproduction package and links to the revised reproduction package. But this will be just a summary and down here you will be able to see the entire reproduction exercise in view only mode but you can share it and you can share it as a basically with stable links. And this is where things I think get pretty exciting in the sense that you can share it with original authors for feedback you can share it with your instructors to grade it you can share it with other researchers to discuss whether or not your assessments are appropriate or not. If you are a graduate student or an undergraduate you can put it in your CV maybe when applying to grad school for example to demonstrate that you have died in depth into a paper or have some more in-depth knowledge of certain researchers or have gotten your hands dirty with data and code behind a particular paper. You can discuss it, we have a forum to discuss it and we hope that you will be able to cite it we're working to have a digital object identifier associated with each of these repractions so you can be cited on all this knowledge that you will be generating. So this will be the result of this exercise and if you're interested and you want to use it in some way as part of your class you can just go to the social science repraction platform I put the link in there and use it as part of your class or as part of an independent study. As I said we have created a extensive guidance behind this in an effort to standardize as much as possible this exercise and guide reproducers and instructors when going through these exercises and also we have built a discourse forum that allows for people to discuss and exchange ideas behind all these exercises. So I think I went a little bit too fast and I have two minutes left but maybe we can leave it open for more questions. Thank you. Excellent, thank you Fernando. Well we have actually a couple of questions coming in. First from David who's asking about rewards and incentives to do reproductions specifically for universities and teachers who are interested in encouraging their students or also for the reproducers themselves. Yeah, so that's one part of the question and then the second part of the question is whether journals, grandbodies and universities should whether they should be you think they should be offering explicit rewards and prizes for people who are participating and I believe Ana's question in the chat is also. Yeah, in terms of rewards basically I think before having this type of standardized way of reporting this type of exercises the only reward that you could have was to basically write a paper and try to get it published or try to post it in some service and I forgot to mention that was the reproduction the application wiki that basically attempt to basically host papers around a reproduction so those would be the past that you would have before. Now we're very excited that by if you do a reproduction using the platform this becomes a sightable object this becomes a shareable object that people could refer to when improving on previous reproductions I think that this will, we hope that this will increase the incentives of doing reproductions and reporting on them because it basically leads a paper trail it's a digital trail of the work that was done. Okay, the other question. The other question is whether journals, grandbodies and universities should, whether they should be offering explicit reward rewards and prizes for the producers and then Anna's asking if we have any kind of partnerships collaborations with the journals around this. Yeah, so one of the PIs behind this project is Lars Bill Hoover who's the American Economic Association Data Editor and basically the idea is that we want this framework also to be used by data editors in different disciplines and different journals and we see these guides and this platform as a compliment to the work that data editors are doing in the sense that it could be a case that maybe a data editor might not have the resources to check the possibility of all the papers that are submitted to a journal but they, for example, they could request authors to submit the scoping section of their paper and then maybe they could crowdsource the exposed after publication, the verification exercise and they can either pay for that, they can keep prices. Basically this infrastructure will facilitate that type of exchange. Okay, let's see, we have a comment in the chat. Yeah, Richard is suggesting that we should get reproducers to use Orchid so that they can be recognized on their research portfolios and then also incorporating component DOI so that the reproductions are related to the original work which is being reproduced and I think this is something that is in the works. Yeah, yeah. If there aren't any questions, I have a question for you and effectively what is kind of like the minimum skills and knowledge of reproducibility or experience with conducting replications before that is necessary so that people can do a reproduction on their own? Yeah, yeah, so that's a question. Basically the minimum skills, we think that this type of exercise can be conducted by undergraduates in the sense that you don't need to understand in depth the methodology of a paper, you don't need to understand, you don't have to have any specific coding skills but what you need to be able to read the abstract or the interaction of a paper and identify what's the main finding or claim that is being advanced in a paper. If you are able to identify that, basically that gives you your target to conduct your reproduction around and with that basically there should be no requirements. You could think that there might be software requirements in the sense that to conduct certain reproductions you might need to have some software but the whole point of this exercise is that if somebody cannot run a reproduction because they do not have the specific software that's important information is that it could be that all the scripts are there but are in a proprietary statistical software that we are not asking people to purchase that thing we're asking people to say I was not able to reproduce it with the current computer that I have but that would be information to record. In terms of requirements it's basically somebody who's willing to put in the hours to go and basically identify the claim and look around under the iceberg out there. Excellent and I guess we have put a little bit more time to answer one more question from Sophia who's asking how does this work with how does this work with closed access articles? Yes, yeah, yeah, excellent question. So our scale, our 10 point scale we also present a modified 10 point scale for articles that you cannot access the reproduction materials and basically the idea is that if there's confidential data or you can not access any part of the reproduction package what we will ask of the reproducer is to report the instructions that are available to obtain those type of materials. So it could be a case of the data is confidential and they cannot get it but there should be a record that says something like in order to obtain this data you need to contact this person in this ministry these are the waiting times, these are the fees this is what you, this is the metadata this is the file size or number of variables and things like that that you will find once you get the data and things like that. Excellent, well thanks for answering all the questions and for staying on time. Next up we have Olivia Misk and Nick Fox. Olivia is a research coordinator and Nick is a project scientist at the Center for Open Science and today they will be talking about the score project. Nick and Olivia, thanks for joining and take it away. Great, thank you so much. Can you see these slides coming up okay? Perfect. Awesome, all right, great. Well, thank you so much. I'm super excited to be speaking with you all. Fernando gave a great first talk and so I'll try to do my best following up. We came up with a very creative name building a research culture for replicability and reproducibility in the social sciences one view from the score program. And so I really gravitated towards this idea of building a research culture, right? So like any good psychologist I'm gonna start this off talking about a French sociologist. Pierre Bourdieu, a French sociologist he wrote this book called The Field of Cultural Production. It's actually a book that I keep on my desk and it's interesting, right? So he was looking at the space of humanity like the humanities and culture from art and literature. And he talked about this idea of reconstructing the space of possibles and where works of art just by themselves they become works of art by the collective belief that acknowledges them as a work of art and that can change over time. Which I think is an interesting lens to kind of view replicability and reproducibility kind of where we are now 10 years after all the precipitating factors of the current crisis quote unquote trying to go forward, let's see. Okay, so the talk is called one of you from the score program. So what is the score program? So it stands for Systematizing Confidence and Open Research and Evidence. It's a DARPA funded program and it is very big and part of its goal is to look at how to give the credibility of published literature. Like I said, it's very big. There's multiple different parts. The Center for Open Science takes up part of this side on the left extracting claims, which came up in Fernando's talk. This is something that we do. We also facilitate replications and reproductions and those lead to outcomes. And those are used as kind of a ground truth for other groups who are trying to predict credibility of papers. So we're gonna talk from this side, from the replication and reproduction side, really the process of the replications and the reproductions. There was really great talk last week with Fiona Fiedler and other speakers about kind of the other side of this project, forecasting of scientific outcomes. And the video of this is now available as of this morning. So if you weren't there last week, please do check it out. It was very, very good. I watched it earlier today. Something that's really cool about the score program is that it covers a lot of the socio-behavioral sciences. It's not just psychology, it's not just one field, but it's actually many fields. It's sociology, it's political science, it's management, it's econ and finance. And as Alex mentioned earlier, and I think my slides are freezing, sorry about this. So as Alex mentioned, kind of in the introduction to this symposium, these different areas have different experiences of replication and reproduction. And they also use different language, right? So in psychology, typically a replication means new data, new analysts and an analysis plan similar to what is described in the paper. Whereas in econ and finance, a replication can mean taking the same data and the same analytic code to reproduce the values in a paper. And that's okay there, right? There's different experiences in these fields. Or do you might call this a habitist based off this discipline, right? It's kind of the upbringing that you're brought up in a field, kind of cements the decision-making processes kind of in that social structure that you're brought up in. The process of doing these replications and reproductions in the score program is very complex. Even at a very high level, there's many moving parts, there's many pieces. Again, Fernando mentioned something similar that BITS does in identifying original materials from a study, which even kind of gives you this landscape of what kind of projects you can do. Contacting original authors, there's multiple types of ethical reviews, power analyses that need to be considered, pre-registration and revision process, peer review process, there's a lot. We'll talk about some of these here highlighted. And I'm gonna pass off the first two to my colleague, Olivia to talk about. Great, thanks Nick. So I'm gonna cover, like Nick just said, the first two parts of our process that he highlighted, and then I'll pass it back to him to dive into talking more about our collaborators and our pre-reg-review process. So a preliminary step of our reproduction and replication process is assessing whether original data code and other materials are available. And there's obviously practical reasons for doing this. So reproductions require original data, but ideally original data and original analytic code. And original materials aren't necessarily required for replications, but the more original materials available, the better. And this also lets us take a look into how researchers share their data and materials, both how they've shared publicly online and how they've shared when contacted for materials. Next slide. And so we started assessing data and code availability across papers within our sample to get a sense of which papers have data and code available and also how many and to kind of facilitate our reproduction process, which requires original data and code. And this is still an ongoing process. So we still haven't assessed all of the papers within our sample. So to do this, we started with looking through the original papers themselves and look for links to original data and code. And then if we don't find any there, we then conduct a web search for data and code looking at various sources. So we do a general Google search and then look through common data repositories like ICPSR or OSF. And then we also check a few other places like author websites and lab websites that sometimes have original data on them. And finally, from there, if we still don't locate original materials, we will reach out to the original authors directly and ask for their data code. And as we do this, we assign what we call process reproducibility scores. And they're meant to indicate whether data code are available and also kind of gauge how easy it is to access original materials. So for example, whether materials were linked in the paper versus stored publicly online versus not publicly available, but in authors sent them to us after we contacted them. So this allows us to take a look into how data availability and sharing rates differ across disciplines and different journals and over time. Next slide. And so what we see so far within our sample is that there's differences between different social behavioral disciplines in how frequently they share their data as shown here. And again, keep in mind, this is just showing some preliminary trends as this is still an ongoing process. And we haven't assessed all of the papers in our sample yet. But looking across disciplines within our sample, so far it looks like roughly more than half of the papers in economics and political science contain publicly available data compared to a much smaller amount within psychology and sociology. So somewhere around like 10 to 13% had original data and code available. And then little to no papers with publicly available data from education and marketing or behavior and other fields not shown here. Though again, we've only assessed a small number of papers from these fields so far. But we do know that some disciplines are ahead of others in terms of data sharing and some of these different system from data sharing requirements set by certain journals in different fields. But another important thing to consider when looking at data availability across fields is that some disciplines rely on existing data sources more than others and sharing or pointing towards these types of sources may look different than sharing newly collected data from a lab, for instance. And of course, different disciplines also use different types of data sources and some data is much easier to share than others for ethical legal reasons. So for instance, sensitive HSR data or data protected under copyright or blocked by paywall sometimes can't be shared or can't be, is it really hard to share? So of course we have to keep this in mind when I'm looking at data sharing within different disciplines. And something else worth highlighting that we're seeing within our sample is trends of data availability increasing over time, which makes sense and is encouraging to see. And it will be interesting to look at and compare how data sharing has shifted over time across different disciplines. So while this is encouraging, we still have a long way to go, of course. And overall, there's not a lot of data availability within our sample, which is part of our motivation for reaching out to original authors to ask for their materials, which is what I'm gonna talk about next. So next slide. So now I'm gonna shift into providing an overview of our approach towards communicating with original authors and highlight some of our intentions and goals regarding author outreach. So during score, we try to involve and seek feedback from original authors at multiple stages of the process. And I just wanna highlight some of our main motivations and some of the benefits for doing so. So one reason is to communicate our intentions and plan for the replication reproduction before the project is underway. So the authors are informed and have an idea of what's to come. Then there's, of course, practical reasons like receiving original materials necessary to conduct reproductions and good faith replications. Another big motivation is to receive crucial feedback on a replication protocols. And we might also be able to get insight into detailed information about the original study that the original paper doesn't specify. And additionally, communicating with original authors might also help build norms around assessing reliability of and supporting prior work. So the more that this work is done and talked about, the more it's normalized, potentially even taken into practice within the research community. And finally, this may also help foster collaborative relationships. So for example, right now we actually have one research lab collaborating with the original authors of their replication and working on a publication together. Next slide. I'm trying. Great, thanks. So now I'm gonna provide a quick overview of our approach towards communicating with original authors. So as I mentioned, we try to keep authors in the loop throughout the process by reaching out several key stages of the project with the goal of maintaining author awareness and to facilitate several opportunities to provide feedback at different stages. So our first outreach occurred once a claim was selected from a paper so that we could receive feedback on the claim selected. And then if their paper was randomly sampled to be replicated or reproduced and sourced to a collaborator, we contacted them again at this point to let them know and ask for original materials. And then we contacted them again once the replication protocol was complete to give them an opportunity to provide feedback on the protocol, which Nick will dive more into this process in a little bit. Next slide. And throughout this process, we were presently surprised to find that many authors were cooperative and helped with the process by providing feedback or sharing materials. And some were also very enthusiastic about the project and were more than happy to be involved, which was really refreshing to see. We were of course happy to see responses like these when we received them. And it was very encouraging to see lots of interests and even excitement in support of this work. But unsurprisingly, many emails also went unanswered which was to be expected. So, as academics are quite busy, roughly 50% of our emails were responded to. And obviously not all of our responses were like these. Overall, our responses were pretty mixed. Of course, not everyone was excited to potentially have their research replicated. Some were skeptical and a few even told us we wouldn't be able to replicate their findings before knowing how the replication would be carried out or even if their study was selected for replication. But overall, we were encouraged by responses like these and are very grateful to all of our original authors who took the time to send materials and provide feedback and participate in the process. Next slide. And it seems as though enthusiasm for replication is growing and it's becoming more of a norm in practice for at the very least it's becoming more of a normal concept for researchers to think about as it gets brought up more and more. So as I mentioned earlier, communicating with original authors is not only good practice for improving the quality of our application and necessary for practical purposes like receiving original materials, but it may also help foster norms around replication within the research community. And with that, I'm gonna hand it back over to Nick to talk more about our collaborators. Awesome, thank you so much, Olivia. So if anyone caught it in one of the earlier slides, one of the steps that I had laid out kind of coldly said sourcing replicators and reproducers. And I think it's there one because it's fit in the box, but really what we're doing is building and fostering a community, right? It's hard to do the kind of project that the SCORE program aims to do without a lot of community support. And I think every actor that's working in the SCORE program would wholeheartedly agree. And so building and fostering that community, I think is really important, not just for the outcome of the SCORE program, but really to kind of build this research culture that more wholly embraces replication and reproduction. There we go. So mirroring the areas of study in the social behavioral sciences. Our collaborators look similar. We have a lot of psychologists, but we also have sociologists, political scientists, management scientists, people who work in economics, education, and they also have very different interests in the project, right? So you can have people who want to do a fully fledged replication or reproduction, but maybe they don't have the lab to run participants, or maybe they don't have the time to put into it. And so there's many places in a culture of replication and reproduction where people can help pre-reg editors and reviewers are very common and people are very interested in working on this. Claim extraction, which is kind of uphill of what we're talking about here. People to help understanding what generalizability means in this context. Even people who are interested in doing kind of the clerical work of making sure that materials to be submitted to local IRBs are complete, right? So something that feels like paperwork and desk work that is maybe not as shiny or fancy as doing a replication or reproduction. There are people there who are interested and who have the time and capability to do it and that helps the process. And so seeing kind of the group of peers and collaborators and community members and all the different places where they can kind of pitch in and lend a hand in doing these types of projects is really important to being successful and to making that culture change. And people also have very many talents which is also extremely useful. So there are experts in certain types of methodology and domains, obviously, in both statistical expertise and software expertise. So this came up in the chat I think during Fernando's talk is that access to certain statistical softwares might be a barrier to doing certain types of projects. And so having a very large and diverse group allows to kind of find the people who can pitch in and do certain things. So while that sample size is small, if you miss someone who has stated that could be a problem, but when the group is very big and everyone's pitching in, that can help for sure. So this heterogeneity among collaborators, while there is some difference between language use and expertise and what it does, it provides a means to reconstruct that space if possible. So we can actually take everyone and all the pieces that they can contribute and reconstruct kind of the cultural environment that we see ourselves in. And I'll try to give an example. So one of the process pieces that we have is that we are replicators and reproducers, they write preregistrations, but then we peer review these preregistrations and it's unblinded and it happens all in real time. We utilize Google documents and so everyone is in that document at once working on it. So what happens is, so it's community-driven, the COS score team really just is the air traffic controller. We just make sure that everyone's kind of staying where they need to be and doing what they need to do. The replication or reproduction team writes the pre-registration. We see it, we make sure all the pieces are in place and we source that to an editor. An editor is a community member and they then find the reviewers who are also community members. We invite the original authors like Olivia mentioned and the replication and reproduction team is there too. Feedbacks provided simultaneously by the reviewers, the editor, the original authors. And a real key piece of this is that there's a seven-day turnaround. There's seven calendar days one week to have this review happen. And so what happens is that the reviewers, they leave their comments, the replication team, they get a ping if you've ever worked in a Google document, you know how many emails you can get from their changes. I'm on all of these Google documents, so I see lots of changes. And so those revisions happen in real time and these conversations can get very long as you can imagine. Once the reviews close, the editor is in a position to make a final decision whether or not the review team needs to continue addressing comments or issues brought up by the reviewers or whether it's clear to move forward. And the guiding star for that is that it's a good faith, high quality version of the project that's being proposed. So at the end of phase one of the SCORE program which was November of last year, we reached out to our reviewers and we asked about the seven-day turnaround. We said, you know, seven days, did that feel like you were being rushed? Did you feel like you were running out of time? And overall we felt like that wasn't overly burdensome. We didn't see many people saying always or often. Most people said sometimes rarely it was a burden but oftentimes it wasn't. Which is interesting if my slides progress. Let's see, we might have to go back a few. Right, so I think this is a really interesting kind of unintended consequence, right? So if we look at the SIREV project which is a project that looks at kind of a reviews scientific review process, they find that that entire review process for an academic paper, right? So it's a little different from pre-registration but the review process for paper takes about 17 weeks on average and there's differences across discipline. And what we did was, and we've done this hundreds of times now, we've been able to successfully get thorough reviews of protocol and methods, right? Everything in a pre-registration in seven days and we use a shared goal. We had a structured project timeline. We dialed in incentives. This also came up in a Q and A earlier. So we do have incentives for viewers and editors. And what we've been able to do is challenge that academic process structure of this peer review and root to building a culture of replication and reproducibility. And I think that is a really interesting signal of this reconstruction of the space of the possible. So I'm gonna hand it off to Olivia for the key takeaways. Great, thanks. So to summarize, I'm just gonna leave you with some key takeaways and different lessons learned that we've found throughout this process. So of course, large scale collaboration is challenging and costly from a coordination standpoint but very worthwhile. So as Nick mentioned, it's a very complex project with a lot going on, a lot of different types of projects that we're doing and so many collaborators. So it's very challenging coordination-wise but it's been so beneficial. We wouldn't be able to accomplish everything that we're doing without the large number of collaboratives that we have and from different perspectives and kind of collaborators that contribute to different parts of the process as well. So that we're able to do so much more. Also incorporating external feedback can be challenging but it's so important for improving quality and facilitating understanding across different domains. So, as Nick just talked about the pre-registration review. So definitely we try to get as much feedback as possible from pre-retribute editors and reviewers to contacting original authors. And this also, the entire score process allows us to learn a lot more about how different researchers from different perspectives approach things like replication and reproduction or things like generalizability and robustness. And it lets us think about and learn more about how people approach inference and how to interpret different types of evidence and what they think is important. And finally, clear communication and expectations are key for large scale collaborations of this nature which we've definitely seen. And clear communication and more communication surrounding concepts like reproducibility and replicability can help normalize these concepts and build a larger culture around replicability. Next slide. And so that is all we have. Thank you so much for listening wherever you're at or whatever time it is where you're at. And I just wanna also give a quick shout out and thanks to other team members at COS who work on SCORE. We have a really great, hardworking team. And of course, these are just the individuals that work at COS on SCORE. We have so many collaborators from all over and contractors working on this project. So thanks so much and happy to answer any questions you might have. Excellent. Thanks so much, Nick and Olivia. We have a question in the Q&A box from an anonymous attendee who's asking if you could provide numbers of how many authors you've engaged with to request materials and whether you observe any trends by discipline in terms of willingness to supply materials that may otherwise not be posted on a repository. Yeah, I can give a little bit of an answer. So we haven't looked at any sort of trends of who answers or who responds. We've reached out to over 3,000 authors at least once. And so depending on randomization and how many projects are being done, some authors are reached out to more than once, but over 3,000 have been asked. So we have done a lot of author contact. Okay, thanks. Do we have any other questions? As I said, if there's also an option raising your hand if you wanna ask your question live beyond just posting it in the Q&A box. Okay. Well, I guess maybe we could save some of the questions for at the end. Hopefully we're gonna have five to 10 minutes for general Q&A with all the panelists. So next up we will hear from Anna Trisovich who is a Sloan Postdoctoral Fellow at the Institute for Quantitative Social Science at Harvard University where she studies computational reproducibility with all its data provenance and data preservation. Anna, thanks for joining us and the floor is yours. Thank you so much. Can you see my slides? Okay, one second. Yeah, we can see your slides close the window. How about now? Yep, looks great. Okay, perfect. Thank you. Oh. Yeah, sorry. So, yeah, my name is Anna and I'm going to present a recent paper. So this is a paper that I have a presentation is based on a recent paper with I Watered with collaborators. And I'm going to talk about evidence-based steps for the culture for reproducibility and reproducibility. So again, not very creative titles. As we have seen in the previous presentations there are a lot of different aspects and perspectives to approach this problem. So kind of like the workflow that we are approaching is something like this. So we have researchers who have prepared their paper. They want to publish something. So they've done a study and then a journal would come back to them and say, oh no, the journal will come back to them and say, perfect, you can publish but first you need to share your data and code and here is a data repository with whom we collaborate. And then at the end, the researchers would share their data and code as the repository and also publish their paper. So this is kind of like the workflow that we are dealing with. And the presentation is going to be organized as follows. So first I'm going to introduce a data repository. So the research data repository. Then I'm going to talk about a large deal study on code quality and execution that I've done with collaborators. I'm going to talk about results and discuss them. And in the end, we are going to see what each of these actors, so researchers, depositors and journals can do to facilitate this process. All right. So first, to kick off, who are we? So what is our perspective? So we are data repositories and basis one. So in particular, database repositories and the database project, which is a free and open source software platform to archive, share and type research data. So the focus is on data sharing and making data available. So the project provides data repository software, that can be installed as institutions. And in some cases, these installations support research communities for entire countries. So this is, for example, Norway and the Netherlands. All right. So currently there are 70 institutions around the globe that funds database installations as their official data repository. And here they are. However, I'm based at Harvard and I'm going to talk mostly about the Harvard data repository. And this is our landing page. And this is also the largest installation actually because the software itself is being developed here. All right. So how does data sharing looks like? So essentially anyone can share their data with a standalone or institutional account. And here on the right-hand side, you can see how this form looks like. So at Harvard database. So, and also individuals, institutes and journals may have their own database collection, which are collections of different data sets. So here you can see two collections of journals. So political analysis collection at Harvard database and also AJPS collection among also other journals that have their own collections. And they have kind of like their own curated data sets from this first of the slides that I showed you. So in particular, they would require water of papers to deposit data and code in these collections within our application data set. So what is our application data set? That is a bundle of data codes and other files needed to reproduce the published study. And here on the right-hand side, we have an example of our application data set for one of these journals. So we can see that there is some metadata, we can see some data sets metrics. In some cases, these replication data sets will have badges if they collaborate with the Center for Open Science. And yeah, there is also a list of actual files, so code, documentation, and so on, right? So here's a quick summary. So we see that we've seen what the data or data repositories have versatile support for data sharing. We've seen that research data and code are shared in replication data sets that often belong to a journal or institutional collection. All right? So now here at Dataverse, we asked ourselves the following questions. How reusable are our application data sets? Can people easily download them? Can they re-execute code files? Can they reuse them? So that's why we've conducted a large-scale study on research code quality and re-execution. So now you might think, okay, code quality and re-execution, what does that have to do with reuse? Well, we expect to see that if code quality and re-execution are high, then we can assume that reuse is also easier, that the data sets will be reusable. However, if code quality and the re-execution rates of code files are low, then we can assume that it is much harder to reuse these files. Right. So this is the study that we've done, that I've done with collaborators. So how does this workflow, how did this workflow go? So it's important to note that this kind of like reproducibility or re-execution studies were all automatic, so all automatized. So in the first step, our application data set is retrieved from hardware data source to AWS, so on Amazon Cloud. In the second step, we kind of like open access replication data set, collect data on the content, code installs use libraries and so on. Then in the next step, we attend to re-execute the code for an allocated amount of time. So one hour per file and five hours in total. And then finally, we would send this result and other collected data to backend database for analysis. So what are our results, what we found? So first, we retrieved 2,109 publicly available replication data sets containing over 9,000 hour files. So it is important to note that we've done this automatic re-execution analysis in our programming language because it is one of the most popular at Harvard database and also because it is free to use. Second, over 94% of the data sets belong to social sciences, which I thought was important for this session. And also another reason for that is that the Harvard database repository was initially created for social sciences, though now it is more of a general purpose repository. So it's kind of like any scientist from any field can deposit their code. All right, so the every data set files in this study, so a median, median was about three megabytes and each containing a median of eight files and typically less than 15. All right, so when it comes to documentation, documentation and kind of like making sense of these files, we see that the average file, this most of the file name length were 10 to 15 characters, which is quite descriptive. We see that around 60% of the data sets contain some documentation, some standalone documentation, such as a read me file or a code book. And when analyzing the code itself, we see that common, the comments in the code comprise about 20% of the code, which is also quite positive, meaning that there is a lot of additional information in the data set itself. Right, so as I mentioned, there are, so this is a study in R and there are some conventional conventions in the open source, our community. And one of the conventions we can see here on the right hand side, so in our packet would have a description file, a read me file, license, namespace, a Docker file maybe. So we asked this question, so can you find any of these convention files in our examine the data sets? So in addition to these ones here, there are other ones such as our markdown, which is kind of like a descriptive R code than our project and install the R, which is more of a phytonic approach to install external libraries used in R. So what are our results? We've seen that less than 1% of the studied data sets contained many of these files, except for a read me that's around the 30% of the, or 48% of the data sets contained. So that's just means that this community, that this community that publishes, that creates research, does research and publishes the dataverse currently does not use this conventional file. All right, we could also see that most used libraries in research code were the ones used for data visualization, data wrangling, import, export, statistical analysis, but we could also see what are the libraries that were not used. And those are the libraries that are actually helpful for software development or for reproducibility. And in particular, we could not see any libraries for co-testing, provenance tracking, environment management and workflow libraries, which is, we just also signals that there is a lot of room for improvement in social science essentially, right? Okay, so now what happens when we re-agriculate the code? What happens? So this is automatic re-agriculation of the code and we had two phases of this part of the study. So in the first phase, we would re-execute the original code. And then in the second phase, we would conduct some automatic code cleaning, such as installing libraries, detecting and installing libraries, maybe fixing some fixed tasks, removing them and then we would re-agriculate the code again. So as you can see, some of the files ran out of the box, but we see that library errors were predominant. And in the second phase, after the second re-agriculation, we see substantial improvement and that's many of the errors were reduced. However, library errors were present and also file path output errors were also dominant and missing files. So we can kind of conclude that here, many code errors can be avoided by capturing library dependencies and of course, testing code in a clean environment. We also see that journals with stricter data policies have higher rates of re-executable codes. So the journal average in total of a code re-execution was something around the 47% whereas a total average is worth five. But then in particular, journals such as political analysis, AJPS and CSRN have the highest re-execution rates. And these are the ones that have also the strictest policies and in particular, the policies that require either reviewing the same code or a verification process so they're reproducing the study itself. So, okay, in a summary, in a big summary, we see that a few data sets use conventional files and we see no libraries for unit tests, sovereignings or workflows. We see that simple and automatic code cleaning can result in potential improvements in code re-execution and this re-execution, so automatic re-execution correlates with journals data sharing policy strictness. All right, so now going back to the actors, the researchers, data repositories and journals and they would all agree that they want open and reproducible files, they're all on the same page. So now let's see, let's see, what can researchers do to kind of facilitate this process? So, first, capturing library libraries use and their versions is critical, so they should do that. So, library versions should be captured by minimally using built-in R functions, such as session info or using the description file in sol.r or by using standalone libraries that capture dependencies or their versions. So, when referring to data code and other files, use relative file paths. Full file paths cause errors when the code is re-executed on another system. So, this is something that we've seen potential improvements when we remove the full path, full file path, we see that the code has other errors, but also it sometimes runs correctly. Then, work or capture and management methods, such as our markdown and targets will help to automatize your code and specify the correct execution sequence. Also, for more advanced approach, so use Docker to document your runtime environment in a machine-readable format and to ensure others can recreate your computing environment. And in particular, there are specialized Docker containers for R, they're called Rocker, and I encourage you to check out that effort. So, now let's see what repositories, what can repositories do? So, let's start from some low-hanging fruits. Of course, having and maintaining good documentation on how to adequately deposit research codes is really important. So, this is how we've done that as the database. Then, integrations with reproducibility platforms, such as CodeOcean, Wholesale, Jupyter, Binder, RangSchool, will also facilitate environment capture and the syncopation of research codes. Also, having a facilitating a discussion on this problem, so potentially having an internal working group will help identify community-wide problems, prioritize them, and implement solutions. And also, a database we have created such a group and I encourage you to check it out. Potentially join some of our discussions. And finally, what can journals do? So, of course, as we've seen in the previous presentation, reproducing a study is kind of like a gold standard when it comes to publications and ensuring that some study is reproducible. So, that is also, for example, done. So, by curators at the Audem Institute, which is kind of a third-party service for reproducibility and then they deposit the data encoded dataverse with this badges that is in the beginning of the presentation. But however, if this is not feasible, then a simple review, if all the files are there, is very helpful. And we've seen here in the study that it does make a difference in code execution. So then, create a reproducibility checklist or template for waters. So, here is an example of one. And then, also, integration to the reproducibility platform is also possible to implement on the journal level. So, we here have seen two examples. So, the journal E-Life has a collaboration with Suncila and our beloved archive has a collaboration with papers to code, meaning that one can instantly access code for any archive paper. Okay, so finally, we've seen evidence of both good and bad coding and dissemination practices. So, we've seen good documentation practices, commenting, code commenting, that convention files are rarely used. It is hard to re-aggitute old code and even harder to reuse it. However, we see that curated replication data sets have higher re-aggregation rates. It is excellent that we are right now talking about these problems, that there are projects working on this, there are tools being created. So, I'm hopeful and I believe that things are looking up when we talk about the culture for reproducibility and replicability. And of course, employing proposed recommendations would further help researchers, and journals contribute to research transparency and reproducibility. So, again, this presentation is based on the findings of the written paper and the encouragements to check out for some more specific details on the study. And thank you very much and happy to answer any questions. Excellent, thank you so much, Anna. We have a question from Rose in the Q&A box which was, her first question reads, why did you decide to use R as opposed to Python for to analyze the work for your analysis? Yeah, that is, so, okay. We use R because there are just really, really many packages. So, they're just packages in R. Whereas for Python, they were much, much less. I think there are maybe around 20 data sets that use Python code at the time of the analysis. And even interestingly, when I created this automatic pipeline, I have done the same study on Python and that is published in another paper. So, Python also is not so very excellent, but R was really good because we could kind of like really have this like large scale study and also kind of like have more statistics on what's happening. Excellent, thank you. Then, while you were presenting a quick question came from Fernando, just to clarify, when you were referring that you were successful at re-running the code, was it simply run or it was able to reproduce the same result? Yeah, yeah, so that is, yeah. This is kind of a re-execution study. So, we were just making sure that the code is not, that the code is not crashing, so that is not failing to, but however, we did look into a smaller sample of these data sets that were fully re-executable. So, if there was a, so out of these ones that data sets were all code files would re-execute correctly, then I kind of had a look in what was happening there. And then in many of the cases, actually, that was kind of reproducing the study. So, it would have the plot for the log file would be the same. So, kind of, it's not the perfect signal of something if something is really reproducible, but I think it's a good head start knowing that the code re-executes. I'm gonna take with the question. Excellent, sounds good. And we have another question from Rose. If you add any plans to attempt to automate the loading of the library says they were when the code was uploaded. The loading of the library, what does that mean? Yeah, will I, so I think maybe a library capture would be a good way to, so yeah, for example, I mentioned that we created this like documentation for uploading research code. We can really see that every year we have more and more data sets containing some research code. So then there's instructions for creators exactly for researchers who have some research code. And when it comes to R, the instructions say that maybe the researcher should have some additional file that will capture this runtime environment. And because I'm a Python programmer myself, I wrote that having a file such as install.r that's going to capture these dependencies would be the best. It's something of an equivalent for requirements with text. So yeah, that is in the documentation for people who want to help out to upload research code. Thank you. Maybe, yeah, maybe Rose, if you have any follow up questions you could follow up with Anna directly. There's a question by anonymous attendee who's asking if you have any suggestions for researchers, repositories, and journals in your view who is ultimately responsible for making sure that data is reproducible. This is kind of like a one minute call or question. Yeah, that's the limit of a philosophical question I think. Yeah, that's a good question. I think researchers themselves should make sure that the code can be computationally reproducible. Also, yeah, journals probably, yeah, it's a good question. I feel that maybe if any of these actors do a little bit, have like a little bit closer, like make some effort, we will be much, we will all together be much closer to the goal of having computational reproducible. So I think maybe there's possibilities shared. Yeah, it sounds like a little bit of effort to go a long way. Yeah, everyone does a little bit of effort, so researchers do a little bit more, maybe you request to do a little bit more, and the journal also do a little bit more, and then I think that's kind of like, yeah, maybe we would have much better chances of reaching the goal. Okay, there is one more. Yes, I got it. So is it debug or the code in the case of an initial failure, were there constraints on what the person debugging was allowed to fix? Yeah, or was there open invitation to try anything to make the code executable? Yeah, that's a good question. So the thing is that this whole study was completely automated. Everything happened on the cloud, everything happened on the Amazon cloud, so everything was automatic. And when it comes to this debugging, so there was, yeah, there is a code cleaning algorithm that kind of like looks for the US libraries and tries to install them, but that looks for fixed path, tries to remove them. So there are some things that kind of like we felt was low hanging fruit to some more common errors that we automatically tried to fix. So yeah, that was then maybe there was a person who was actually looking at what was happening. I believe all of the files could be executable, but it was kind of a very automatized approach. But again, I think it is a good signal that even with the automatic code cleaning and automatic execution, just with some small changes, the things can really improve. Thank you. That's good. Well, I think we're at time at this point. Maybe we could save this final question for the end or it can be, Anna could type the answer. But yeah, so next up we have, and yeah, thanks again Anna for our wonderful presentation and for answering all the questions. Next up we have Jade Benjamin Chung, who is an assistant professor at Stanford University at the Department of Epidemiology and Public Health. Her primary research is on interventions to eradicate environmental disease and today she will be talking about internal replication based on her own experience at the lab. Thanks very much, Alex. Let me get my slides up here. All right, can you see them? Yep, looks great. All right, great. So as Alex mentioned, I'm gonna be speaking about internal replication of computational workflows and scientific research. I'm an epidemiologist, so my case study today is coming from that field, but I hope I can persuade you that the methods that I'm gonna be presenting are really relevant to a wide range of disciplines with computational workflows. So here's the outline for my talk. I'll start by defining what I mean by internal replication and same disclaimer that Alex gave earlier, I'm using the term replication a little bit differently from some other presentations and really what I mean here is can you get the exact same answer using the same data and potentially the same code. So I'll start with some definitions and then I'll talk about how internal replication can reduce bias. I'll present a case study of randomized trials on water and sanitation interventions to improve health. And then I'll close with some alternatives and implications. Here's a schematic of how we typically do things in science. We define our hypothesis, collect some data, analyze it, and then we publish our results. And much of what this conference has been focused on is additional tools to help improve the reproducibility of the published evidence-based published literature so we can register protocols and pre-analysis plans to reduce publication bias and confirmation bias. We can create a reproducible workflow such that a single analyst can re-run the code and get the exact same answer every time. That doesn't mean that we've necessarily reduced errors or bias though and so I'm gonna be talking about some ways to do that. We can publish our code and data to increase transparency and support replications by other teams. And we can do what I'm referring to as external replication just to contrast it with the internal replication process I'm presenting. And this is where a team of scientists who were not initially involved in a study use the data and sometimes the code to try to reproduce or replicate the answer after publication has occurred. And then my focus today is on internal replication which is a process that we can go through prior to publishing our study results. And really the goal here is to catch errors upstream before we publish to reduce confirmation bias and other forms of bias that occur in the analytic process and to hopefully result in a more accurate published literature base that leads to a more efficient scientific process. So here's a schematic and I'll just cover this at a really high level and then I'll go into a more detailed example a bit later. So there's collaborative steps and independent steps in internal replication. And first, ideally you're working off of a pre-analysis plan. And once you've collected your data occasionally you do need to make changes to your pre-analysis plan. And so the two analysts can get together and discuss the analysis plan make changes to it and document them. And then they move forward and begin creating their analysis data sets working completely independently. So ideally in advance, they've agreed on some naming conventions, file structure, et cetera. But besides that, they're not looking at each other's code and they're not really having any detailed conversations about their process. They could use completely different processes and even different software if they desire. And then they come together and they compare their data sets and assess whether they are functionally the same. Meaning, do they have the same columns essentially with the same values that would allow them to perform the same type of analysis at the next step. And this is a process that is iterative. Usually they come together and find that there are some differences, discuss what could have led to those differences, resolve them and repeat until they have functionally identical data. Then they move forward and independently conduct usually the most simple analysis. So in our case, it's on adjusted analysis. These could be descriptive analysis. Perform those fully independently. Don't look at each other's code. Come together and see if you have the same answer. And in our experience, it really helped to pre-define a threshold. So you could say, for example, that you wanted every estimate, every point estimate, standard error, confidence interval, P value. If you took the difference of the two replicators of their estimates, it would be smaller than 0.001, for example. And we chose that number, not because it's particularly meaningful, but just because in a publication, if it's smaller than that, it's not gonna make a difference in the sort of manuscript itself. And so you basically repeat this process until all of your results are within that threshold when you take the difference. And then you can move on to your more complicated analyses, adjusted analyses, et cetera. And on this slide, I actually have it showing that we could collaboratively create tables and figures, but there's no reason why you couldn't also internally replicate those independently. And in our experience, I put these two clocks here to indicate that the data cleaning step was very time consuming to replicate because it involved the most subjective decisions that really needed to be scrutinized and sort of adjudicated between the two replicators. And the adjusted analyses and more complicated analyses, especially if they involve any kind of stochastic estimation can also be time consuming to replicate. So now I'll talk about how this process can reduce bias. And I'll go over these sort of four main categories of biases and errors. And I wanted to say earlier, and I'll mention it now that really where we're coming from with this approach is the idea that these sorts of biases and errors are human nature. So rather than try to pretend that they're not happening, we should just embrace the fact that these are bound to occur in any study, we've all made mistakes. And so we really need a system in place to help us reduce these sorts of biases and errors in our standardized streamlined fashion. So the first one is, we may have a pre-analysis plan ideally, but it can't possibly cover every single analysis decision that we have to make. So we may end up making what would seem like relatively small decisions over the course of our analysis, but they have a tendency perhaps to confirm our biases. And so this could be, ideally you would prespecify some kind of algorithm for selecting the variables that go into your adjusted statistical models, but perhaps there are some exceptions to the rule and you have to make some decisions around that. Maybe there's some outliers, maybe there's some missing data. You can't completely prespecify how you're going to handle that in most cases. You need to kind of see the data that you're working with. And so again, these are decisions that need to be made. The internal replication process essentially kind of brings everything to light. It brings all these small decisions to light because it's nearly impossible to get the same answer. If you aren't making these same decisions. And so what we found was that by doing this, we basically created a log and I'll show it to you of all these things that we had to decide on that came after the pre-analysis plan and even after our update to the pre-analysis plan. The next one is even smaller judgment calls, right? So this isn't necessarily about the analysis, but this is what if we have multiple datasets emerging and we can't get the ideas to perfectly match or what if we thought we could analyze a variable on a continuous scale but actually for whatever reason we need to do it as a categorical variable and we have not prespecified how we're going to discretize it. So there's lots of smaller decisions that are going to come up. And again, going through this replication process will require that they be discussed openly and decided upon transparently. And then I'll just add that we use a process called blinding or masking. So in the clinical trials world, we often think about this like offering a placebo to our control groups so that our participants are masked. They don't know which group they're in. And also we often learn that it's a good idea for your outcome assessor, the person who's measuring your disease status to not know if somebody was getting the drug or getting the placebo. But we can take this farther. So our analysts can also be blinded or masked to the treatment assignment of the participants in the dataset. And we can simply do this by scrambling the variable that contains the information on the treatment assignment. So we can just randomly shuffle that. And there's lots of different ways to do that that kind of accommodate stratified designs and match designs and still allow you to perform your analysis as planned. And even if it's not a trial, you can still do this with whatever kind of experimental condition variable that you have. It's really just a matter of scrambling that and then performing your whole analysis, all of your internal replication steps on that masked variable, you internally replicate everything. And then once you're done with that and you most likely have a lot of null estimates, right? And some chance, you know, by estimates that appear to be not null by chance, then you rerun all your code with the original treatment assignment variable and you get your answer. And this can really help to reduce confirmation bias because you simply don't know, you know, what answer you're getting while you're in the process of coding. Number three, this one's really simple. We make mistakes, typing errors, it's just a fact. And a typical paper could have hundreds, if not thousands of lines of code. And it's really hard to prevent this. You know, even pair programming where someone's sitting next to you and reading your code, it's just difficult. It's difficult for our brains to kind of wrap our heads around all the level of detail that goes into a typical paper. So this is really, I would argue, one of the main reasons to do internal replication is that it's the best way. I would argue to catch these types of errors and fix them before they get into the scientific record before they influence policy and decisions. And then finally, you know, using different analytic software can sometimes give you different answers because different software may have different defaults. You know, we, in our experience, we were using Stata for some things and R for other things and minor things around how they handled, you know, missing values and rounding did lead to some discrepancies. And so I would argue that it's kind of a good thing to uncover this because you may or may not agree with, you know, how the default was set and maybe you need to be a little more careful around that. And this is a process that you can use to check that. So now I'll talk about our case study. So I'm gonna talk about two randomized controlled trials. They were called the Wash Benefits Trials. They're sister trials conducted in rural Bangladesh and rural Kenya. And they were looking at water, sanitation, hand washing and nutrition interventions delivered separately and in combination. And so that's what in this figure, each of the rows is indicating an intervention arm. And it was a cluster randomized trial with large numbers of clusters and young children enrolled in utero and followed for the first two years of life. And we tested, you know, three different hypotheses, again, two countries and many, many different outcomes. And this was funded by the Gates Foundation. And in our discipline, at least, you know, these trials results were eagerly anticipated and we really, I was on the analytic team involved in the trials and we really wanted to make sure that our analysis process was as error-free and free of bias as possible. And so that's what really motivated us coming up with this framework. And I'll give you sort of a schematic of how we went about this. So we started off with one country, primary outcomes, you know, just the core sets of hypotheses. And we did this in Bangladesh and we did this to establish our workflow. So we did a complete internal replication, two independent analysts. We developed our optimal structure for analytic analytic datasets and we replicated everything. And that took quite a bit of time. I would say, you know, more around double person time it would have taken if we had not replicated. And then after we completed that, we used our replicated code to create a software package in R that the subsequent analyses could rely upon so that we didn't have to replicate again every single step, every single time, you know, for the analytic pieces that were repeated, we could use our internal software. We then tested that on the Kenya primary outcome analysis and debugged the program that way. And then for the remaining analyses, you know, all the secondary tertiary outcome analyses, we use that package. And of course we still had to, you know, internally replicate the data analysis dataset generation and some of the steps, but the kind of core, you know, hypothesis testing steps could be reliant upon what we initially did for Bangladesh primary outcomes. So there was a really large upfront investment but then really large efficiencies gained with additional analyses that followed the same overall statistical analysis plan. So we registered the trials on clinicaltrials.gov. For anyone who's familiar with that, it's a great site, but it doesn't have a ton of detail. And so we also published a peer-reviewed article with the study protocol, with the rationale for the study, much more detail about the study design as well as the statistical analysis plan. But of course that didn't have, you know, the full level of detail that we needed for the pre-analysis plan. So once the data was collected and we knew about some relatively small changes that had occurred in the field, we worked on an update to that analysis plan, which we published and registered on open science framework. And this served this in combination with the, you know, published protocol and the registration, we're kind of the backbone that we used in our internal replication. And so here's a peek under the hood. This was one of our many Google docs and this shows you how we logged the many, many decisions, small and larger that we made that went beyond what was in the pre-analysis plan. And, you know, we had the field staff weigh in. In some cases we had the country PI weigh in and these types of decisions included things like how to categorize things in the other specified field. What to do if a child's birth dates were discrepant across different data sets? What about mismatched IDs? All these kinds of things actually can really matter. You know, you would hope that they wouldn't really drive a different estimate at the final stage but you never know, sometimes they actually can. And so we have all of these things and you sort of logged in a conversational way in our Google doc. And then we used GitHub for version control and these are, again, peeking under the hood at some several years old commits at this point but it shows you how there's numerous examples of errors that we caught in this process that we were able to resolve again before publication. And so it includes things like, you know, we had the wrong subset of the study in a covariate screening that we did. We had different definitions of the categorical variable. We saved results to the wrong object, et cetera. So these are all things that we caught and fixed. We created an R Shiny application that allowed us to really efficiently compare all of our results. And here's just a screenshot of the software package that we made. And I have the reference to the paper at the bottom here. I'm not gonna go into all this detail but if you're interested, we came up with lots of tips on programming to help increase the efficiency of the internal replication process. So you can refer to the paper if you're interested. And all these resources are on our OSF page. So I'll just close with a few thoughts on alternatives and implications. So the choice of whether to replicate or not to me comes down to several things. There's obviously pros and cons. I've made the case that the pros are that we can minimize errors and bias but that doesn't mean there won't be any, right? Especially if you're doing this with a team member who perhaps is in your same lab with your same training, you know, you code in a similar way. We're all subject to group think at times and so again, it's not gonna guarantee it but it will, I would argue, reduce both. Funders are increasingly interested in concrete steps that we as researchers can take and this is something that I've been including in my funding proposals as something to do to increase reproducibility of my work. And another pro is just that it's a great bonding experience. You're really good to know the person that you're working with because it's a pretty intense process but as I mentioned, the con is that it does increase person time and so you do need dedicated resources for this. Considerations on whether to do it. Is this a study that's exploratory or do we think that policy could be based on the findings? In the latter case, I would argue that we should consider doing this. Do we need to replicate everything or just a piece of the analysis that we are more concerned as error prone? Could I replicate part of it myself, right? So in some smaller projects, you know, I simply don't have resources to do this but by hiring another person but maybe I can use two different software packages and see if I get the same answer. So there's shades of gray here. I've sort of shown you a more complicated version but there are variations on this. And then just to quickly compare this to pair programming which is where someone's coding and then a second analyst sort of sits with them and reads and comments on their code as it's written. You know, the challenge with that is while it may help to reduce some errors, you know, there could be shared biases and also reinforcement of certain kinds of judgment calls that occur when people are working together instead of independently. And then also there's this movement to towards pre-publication code review. So American Journal of Political Science and some nature journals are now doing this where peer reviewers will actually replicate a study finding, you know, before it's published. I think this is a great thing. And I would argue this is complimentary to internal replication but of course it's ideal I think from the author's standpoint to avoid having your peer reviewer catch errors at that stage, which is another reason to consider using this. And then finally, as I said, you know, National Institute of Health and potentially other funders are really encouraging scientists to think about what they can do substantively to improve reproducibility of their research. And so I would encourage you to consider this as a tool in your funding applications. And with that, I'll thank my collaborators and funders and take any questions. Excellent, thanks so much, Jade. I don't see any questions in the Q&A box. Fernando had a clarifying question about how you consolidated differences in each step of the process. Yes. Yeah, I can share this slide. I moved quickly because I was conscious of time but since we have a minute, let me show you. So we made this shiny app. So basically we created a naming convention for all of our results objects. And so this is showing you, you know, risk difference, confidence interval, T statistic and P value. And what we could do basically is filter on, you know, which outcome are we looking at, which measure were we looking at, which analysis, what was the hypothesis? All of these were features of, you know, the paper, these things end up being consolidated into figures and tables. But at the end of the day, we need to see every single number. And so this shiny app allowed us to take the difference between, you know, in this case, I was one analyst and then another person named Andrew. We took the difference between these and would make sure that it's zero. And, you know, in the previous chat, we could filter on whether or not it was replicated. So this is showing complete replication, but this was really helpful. Before we created this, it was really difficult to keep track of, you know, hundreds and hundreds of numbers that you're trying to match. And so I think that the code for this app is up on our OSF page for anyone who's interested. Thanks a lot. So don't see any questions in the chat right now. I have a question of my own. I'm sorry. Sorry, I'm also trying to formulate the question. So I, yeah, maybe I can just ask like this. So how much longer does it take to complete a study with internal review in comparison to without it? So kind of like, what is this like price in time and effort? It's a great question. Yeah, so I said it approximately doubled person time, right? So the calendar time could be similar, if not a little bit more, right? Because you're not just coding alone. You have to have all these conversations, you know, while you're coding in between each step. But I think it really depends on the particular analysis, you know, and how complicated it is. And if the data that you're getting is really, really clean, and if it's really complete or not, you know, if it's not, then I think that it could really increase the amount of time that it takes to complete your analysis. Thanks, Jade. We have a question from an anonymous attendee that's asking, is there opportunity for crowdsource kind of approach for this kind of internal application? Like some sort of a hub where analysts can make themselves available to be a replicator for someone else seeking one? That's a very interesting idea that I haven't thought of before. And I guess, you know, there's no reason why we couldn't do that. I guess normally the sort of advantage of doing internal replication as opposed to external is that usually your team members are more familiar with the study. And so there's efficiencies there because they usually know exactly that how the data was generated and, you know, they're really familiar with the study team. But again, yeah, I don't see why you couldn't do that if, especially if you didn't have anyone available on the team to help. OK, excellent. And I guess one additional question that I had is what are your recommendations for sort of open science entrepreneurs who are trying to get their collaborators on board with internal replication? Where should they start? And what kind of, what are the resources to first consult before getting started? Yeah, I mean, I'd point them to our paper. We lay out lots of tips and advice on how to go about it. But I think a good starting point is using it with students and trainees, right? So if you have someone who's new to your lab and you want to get them up to speed on kind of how you typically do things analytically, this is a great experience for them to learn the ways of your lab and to kind of get their hands dirty, so to speak, with a project and get a lot of sort of one-on-one time together with the other analysts. So I would say that would be a good place to start. OK, excellent. So I guess we've arrived to the part of our panel where we're going to open up the floor for general discussion. So if you guys have any questions about the panelists at large, please feel free to either post them in the chat or just raise your hand and we can unmute you to take the floor. Or also, if the panelists themselves want to reflect or ask questions for their peers, yeah, this is a good time. Yeah, I would like to ask you guys what you think about who is the person to build. Who does like the process I got. So I'm wondering if you have any thoughts on that one, like who is ultimately responsible for ensuring reproducibility and robustness, although maybe, yeah, whether the kind of like science itself is, you know, true phenomena or not. Yeah, excellent. Thanks for bringing this up. This, I think, came up during your presentation. So maybe let's try to answer going around the same way, the same order as we did with the presentations for none. Do you want to take a stab? Whose responsibility is it to make research reproducible, replicable? Yeah, that's a very good question. And I think it's basically, I think, probably Nick has a better like systems approach to this. But I think something along the lines of all the different agents in this system, I'll say students, original authors, journal editors, funders, and yeah. So I do think that standardizing a little bit the exercise of doing a reproduction could help with that. But yeah, doing large-scale initiatives like what the OSF, the COS is doing and providing infrastructure that the repository is, what anites involve, so it's assistance issue. I actually had a student ask, I actually had this student yesterday ask this question. And so I had 24 hours to think about it. What I told them was that I think it comes down to the journals, and maybe not the journals like the specific company or whatever. But I think it falls into the peer review category. I honestly think there's an evolution that needs to happen in the peer review process. Now that more code is being made available, the amount of times that peer reviewers even look at the code I think is questionably low. So thinking about what is the role of peer review in a world where now data sharing, code sharing is becoming more of a attainable goal, how does that fit in, and how do journals keep their peer reviewers up to that task? Because I think asking them to do more work for free is also a bad idea. So thinking about that whole relationship is something that I think is where it needs to go. But this also means that authors should be sharing data and code. And I don't know if we're there yet. So again, I'm rambling because I don't have a good answer. But I think that's where I would like to see it in the peer review kind of arena. Thanks, Nick. Olivia or Jade, would you like to elaborate or chime in as well? I have some thoughts. I mean, I think from a practical standpoint, the burden is on the authors to create great metadata and reproducible data, et cetera, and to nicely package everything so others can actually understand what they did. But in order to make that really a fair ask, we need to change incentives really at all levels. We need to change, like Nick is saying, how peer review is done. We need to change criteria for promotion. We need to change criteria for funding. It really needs to happen across the board. Currently, at least from my vantage, I see sort of a mix with a lot of early career researchers adopting these practices, but not all. And it takes more time. And actually, it opens people up to some vulnerability that their code and their work could be scrutinized when sometimes other people are just getting away with really high impact publications that aren't ever scrutinized. And so I think to make an even playing field and to make it the norm, it really is going to take a change at every single level and especially a change in incentives. Couldn't agree more. We have a question from Sam Teplitsky in the chat who's asking, for all the panelists, your work is dependent some more than others on different product software infrastructure, not all of which is open. How do you think about supporting open tools in particular so that they're sustainable and reproducible long term? Who would like to try to answer this? Jade? I can say something first, but I won't fully answer it. I mean, we try to use all open source software in my lab for a variety of reasons, partly getting at this point. In our case, actually, in health research data, being able to publish data, it's not always possible because of human subjects protections. There is some work being done to look into whether you can create synthetic data sets that are basically going to give you the same answer, but that protect patient privacy, again, by scrambling variables and that kind of thing. I think that's sort of a novel idea. I'm not sure where it will go. But there is increasingly more and more open software and tools, at least in my field. So to me, the data, being able to share protected data feels like potentially a bigger challenge. Any other thoughts from the other panelists? Yeah, I can also for just maybe, I think it's a really hard question, especially when I'm thinking for a long term, what is long term? I mean, we have some communities such as Python community and our community, they're very active, right? You can see that there are many, the community, they're big. There are many people using these tools, many activities going on. But then so I maybe would just say that if we are encouraging from tools and from, and if we are choosing a tool or a program language, I would just say maybe tools, the ones that has a lot of contributors that has a big community. So maybe that's the thing I want to think and say is just I think that institutes or research groups or researchers can use kind of some open source free tool that has also a bigger community. Because then that means that bugs are going to be fixed. Maybe it's going to have like a longer lifetime than others. That's kind of like my thought about it. Thanks, Anna. Fernando or Olivia, would you like to chime in on this? OK, well, maybe for one last question to kind of like wrap things up since we're coming up at time, what is your vision for your projects? So everything that we heard today is very exciting and very promising. Hopefully it contributes to getting to this ideal of creating culture for reproducibility but where would you like to see your projects in the long run? So say like beyond two to three years from now, how would you like to see what you guys have created applied in the real world? And then maybe we start in the reverse order this time around. Jade. Yeah, sure. I mean, I would love for others to adopt internal replication. Obviously it's resource intensive, but that's the easy answer, right? I am encouraging my students and my mentees to do it. I talk about it when I teach epidemiologists. I talk a lot about reproducibility as much as I can in order to kind of instill the culture from day one of a class I'm teaching. And so that's sort of my vision is to have this not feel like a crisis five, 10 years from now. It's just what we do. It's good science. It's best practice. Thanks. Anna, 30 seconds or less? 30 seconds. OK, so maybe this workflow, this automatic workflow for code re-execution and for code cleaning, it can be maybe employed in as part of the reproducibility process. So maybe instead of just kind of like people starting from scratch and trying to re-execute from scratch, maybe first we can do an automatic approach, automatic code cleaning, and then give it to the students who may be re-executing the code and see like what has happened in this automatic re-execution and how that can be improved a little bit longer than 30 seconds. That's my. Thank you. Nick or Olivia? I'll just chime in quick and say basically there's a lot going on in SCORE. So there's a lot I could say about this. But I would love to be able to see actually some use with some of what we eventually turn out results wise and people be able to dive in and use some of the trends we find into seeing what to assess other papers of what might be reproducible or replicable and the tools that the other teams are working on, AI tools as well as like crowdsourcing tools, just basically using this to be able to easier, more easily assess credibility and especially like communicating this out to the lay public. Excellent. Thank you. Anando, can I just unmute yourself? Yeah, yeah. So yeah, it's five years from now I would love for hundreds of students around the world, not thousands, be using the reproduction platform. And not only that, but improving the reproducibility up to a point that it becomes more and more the norm that we see very high levels of reproducibility and we can move to sort of the next frontier of systematically assessing robustness of the entire body of literature. And I really like that. Excellent. Paul, thank you. Big thank you to all of the panelists for joining us today and for presenting your work. Also, big thank you to the attendees for raving along and I realize that it's pretty late, especially if you're getting in from the East Coast of Europe right now. And then finally, thanks to the organizers for honoring us and including this session and the agenda. Thank you all. Thank you.