 So, I just want to introduce our next speaker. This is Alexandru, our Alex Samarkochi, and he's talking about predicting the replicability of social and behavioral science claims from the COVID-19 pre-print replication project with structured expert and novice groups. Thanks, Alex. Thanks, Nadia. All right, thank you. So, today I'm presenting some ongoing work I'm doing with a number of collaborators. I try to fit everybody there on the screen, some of whom are in the audience today here, and this is about predicting the replicability of social and behavioral science claims. Now, this work is motivated in part by the already established body of evidence that replication rates in the social sciences are relatively and worryingly low. Moreover, there's some emerging research suggesting that not only are papers not replicable, but non-replicated papers tend to propagate through the literature at a much higher rate than papers that do replicate. So on the right-hand side here, you have a picture from Sara Garcia and Nizi's paper in 2021, tracking the studies including the three papers on the left in terms of their impact through the literature, and you can see that papers that fail to replicate in those three studies tend to be cited more than those that did replicate. Now, in response to this growing body of evidence, one of the reactions has been to call for more replication studies, and now you see this enacted through different policies, like more journals are asking for content types which relate directly to replications. They're more called for papers, there's more funding for doing replication work. However, it's really not possible to replicate everything that we may want to replicate. So, those studies are very resource-intensive, expensive, take a long time. The volume of papers that is in need of replication is huge and overwhelming, and even if we were to try to replicate the papers that are most impactful, could be most impactful, and we could worry about their impact, even those are just too many to actually conduct high-power replications of. Third, there are a lot of situations where we would like to know how reliable the evidence is, especially in crisis situations, and we just don't have the time to conduct replications before policymakers have to make a decision about whether to include recommendations from a certain paper into their policymaking. In response to these concerns, one good thing to have that will be very helpful is if we were able to generate accurate predictions using experts about the outcomes of replication studies. If we were able to accurately predict if replications would be successful, not that would help a lot with this project, and the purpose of this would not be to replace the actual replication studies, but at least the expert predictions could help us better guide where replication efforts are more needed so that we can get more return for our money and we can provide accurate feedback quickly. Another beneficial consequence if we were able to successfully predict the outcomes of replication studies would be that we would be able to strengthen existing academic quality control practices like peer review because although peer review addresses many aspects of research quality, one of the key aspects that peer review does address is the replicability of research. Now, all of this would be excellent. The problem that we have is that we don't know who to go to, who are the experts in predicting replicability, and what are the best practices for eliciting their judgments about the outcome of replication studies. And we know from previous research that the people you think are most likely to be experts at predicting something tend not to be the experts. Now, the study I'm going to present to you today was conducted as part of the DARPA score program, and I think Tim Errington from COS tomorrow is going to talk more about the score program, so I won't say too much about it. This is a study that was done in partnership between the replicats project at the University of Melbourne and the Center for Open Science here. And in this study we recruited two groups of participants. We recruited a group of experts. We had short of 100 experts in the social sciences, which were people who had at least a prima facie claim to being good predictors about the outcome of replication studies. They were PhD students to full professors in relevant disciplines. And then there are the kind of populations that you would call for to do peer review habitually. And then we had a group of novices who were undergraduate students at the University of North Carolina at Chapel Hill. Now, the two groups differed substantially amongst them. The experts had all the markers, all the traditional markers of expertise you can think of. The experts were older, had published more papers, had higher educational attainment, had more experience with open science methods, had more experience with replications, and scored higher on a quiz we designed to test their statistical and quant methods knowledge. The Center for Open Science extracted 100 claims from COVID-19 preprints in the social sciences published between April and May 2020, mainly from three archive preprint servers. And they coordinated a global and massive effort of replications of these papers. In the end, we ended up with 35 replication studies of this corpus of 100 COVID-19 preprints, 20 of which were successful or replication rate of 57%. On the right hand side here, you see a brief overview of the papers and our participants judgments. And I'll try to use this. So here you have old 35 papers we replicated. Here you have the likelihoods our participants provided for our likelihoods for these papers to replicate. In blue you have the novice judgments and in yellow you have the expert judgments. The left hand pane are the papers that failed to replicate. The right hand pane are the papers that did replicate. The top pane are replications for which we collected new data. And bottom pane are papers for which we identified secondary data to use for the replication. In order to elicit the judgments, we used a Delphi-style method called the IDEA protocol. The protocol asked participants to first make a private judgment about the likelihood that the paper would replicate. The elicit judgments using a three-point method, that is, we asked participants to provide their best estimate of how likely it is for a paper to replicate and also provide a lower estimate and an upper estimate. And the width between the lower estimate and the upper estimate, we interpreted as being the competence around the participant's best estimate. After the first round, which was private, each participant made these assessments on their own. We assembled the participants into groups. The groups were formed of between three and six people. And in those groups, anonymously, they exchanged reasons for their judgments and scrutinized each other's justifications for the judgments they were making. After the group discussion, they went back and made a second assessment privately again. And the idea is that in round two, they were asked to take stock of the conversation they had and the interaction they had with their peers and see if they want to revise their initial statement in any way. They were free to say no and put the same assessment as in round one in round two. At the end of this, we generated group judgments mathematically in post-hoc, by which I mean the groups did not have to reach any kind of consensus in their discussion. And very few groups actually reached the consensus in their discussion. The results that we found is that there is substantive difference between the novices and the experts in terms of their behavior. Whereas experts remain steadfast in their judgments and changed incredibly little between rounds and took very little stock of the discussion they had with their peers, novices moved a lot. In terms of their best estimates, the novices shifted their best estimates by more than five percentage points compared to the experts. In terms of their confidence and in terms of the interval width between the lower and upper bounds, although in round one, there was no difference between the novices and the experts. In round two, the novices became more confident in their judgments, the intervals narrowed, and they changed by significantly more than the experts changed between rounds. However, despite the difference between the experts and the novices in terms of what we think of traditional markers of expertise and despite the differences in terms of how they reacted to the group interaction, the group phase of our elicitation protocol, we failed to find any difference in accuracy between the two groups. In either round one or round two on two different metrics of accuracy, experts and novices in our study were equally accurate. What is more, we found that their judgments are moderately correlated when we looked at the 35 papers that we had replication results for, but when we extended this analysis to the full 100 paper corpus, we found that in fact the experts and novices' judgments were strongly correlated. In line with our preregistered hypothesis, we found that novices are more responsive than experts to feedback into the group process, and I think this opens up an interesting area of research where we can try to identify how to use their responsiveness and malleability to group processes to improve their performance and maybe hopefully reach a certain level of performance that would make us comfortable to expand our pool of experts that we go to when we are trying to make predictions about research quality and about replicability. However, against our preregistered hypothesis, we did not find experts to be more accurate than novices. We thought that given their practice and their training, people with PhDs, people who are early careers, full professors, but even PhD students would be far more accurate than second 30-year undergraduates, and that just was not the case in our data set. There are many reasons why that could be the case. We are looking at preprints, and there are sometimes significant differences between preprints and published papers, and maybe the proxies that our experts were trained to use were not present in the preprints, so that threw them off somehow. We haven't analyzed that, and we don't know if that's the case. It could also be that the research area was an emerging area of research, so these are papers in April and May 2020 on COVID-19. A lot of the data collection happened throughout the pandemic. This might have impacted in some way their judgments. Also, there is a further question here, which is, what would be a meaningful or practical difference between the experts and novices' predictive ability that would make us think that novices, for instance, undergraduate students, would be equally suitable as PhD students or early career researchers to go to ask them about the quality of a research paper? The manuscript is now out on MetaArchive, and I encourage you to read it for further details. If you have any questions, I'm happy to answer them now, or please email me. I have my contact information here. Thank you, Alex. We have some questions here at the front. One that just jumps immediately to mind. I'd love to see you try a large language model and see how it does. You'd have to tune a system prompt a little bit to get it to be decent, I suspect, but I would bet a small amount would be better than either your experts or your novices, but it would be a wonderful experiment to try. Have you thought about that? So that's an interesting question, and maybe Tim will say more about this tomorrow, but part of this core program was to use artificial intelligence to learn from this and other data sets we generated throughout the program so that we can train a system to do these assessments automatically. So this is, I think, still ongoing work to train. I don't think they're looking at a large language model. At least the last time I checked they weren't doing large language models, but they are looking at AI systems to generate these predictions automatically based on the expert data sets that we've been producing. The beauty of large language models produce some amazing results with no work at all or very little, so they're worth a try. Hi, thanks for that paper. Nicole Nelson, University of Wisconsin-Madison. I have a question for you about your prediction curves. Some of them look rather normal and some of them look quite bimodal, and the bimodal distributions sometimes tracked on both experts and non-experts and sometimes didn't. So I wonder if you have thought about analyzing the data based on the shape of those curves and if you have any thoughts about the kinds of situations in which you get a highly bimodal distribution, which suggests that there must be something that's really divisive, basically, and whether or not people think it's predictive of replication or not. Thank you. So that's a fascinating question. We haven't done that, so I don't have an answer to this question for this particular data set. We haven't done any qualitative analysis to try to understand why some papers divided our groups in that way, but you're correct to observe that this is a behavior we've noticed. In other work that we've done, we did try to do some qualitative analysis of the rationales our participants provided for their predictions, not in this study. But in this other study, we looked at the reasons that people provided for their predictions. I think Martin was one of the persons who did that analysis for a paper that is just coming out in the Royal Society of Open Science. And we found identifiable markers in the rationales that predict the accuracy of our participants, things like the number of reasons that they invoke or things like certain aspects of the paper they look at. There are things you can identify in the rationales that tell you how accurate their predictions would be. But what leads to this divisive assessment of the paper? I don't know, and it's a good question. It's something that is worth pursuing. I agree. Thank you, Alex. That's about it for questions. Thank you very much.