 Last up, before lunch, we've got Tim Errington, who's going to talk about objective human and machine assessments of confidence in research clients. Great, thank you. I get the honor of being the last one to stop you from eating lunch, and we're running over. So that's a really good honor I get. I'm going to try to stay on time as best I can. I kind of give you an overview of information that's been going on for the last couple of years on a project. Before I get there, I want to present kind of the motivation for it. This is not actual. This is just kind of made up numbers. It shows the present state we're in. How do I assess confidence in research? And you have things at the bottom of this table, which is I can do things really quickly. You can use these heuristics. We've just been hearing about how dangerous those are. But it allows me to have low accuracy, as we know, but I can do it really fast. The top of this table are ones where we know it can give us higher accuracy towards the credibility of it, but costs a lot, both in time and money. And so this is the challenge we have, which is we want accuracy and we want it at scale. And that led to this motivating question, which is, well, can we do that? Can we start to develop tools and methods that allow us to hit this compromise, this tension we have between wanting these two things? So this was the motivation for a program that DARPA funded that has just concluded. We were spending the last four years on trying to see if we could assess confidence in research evidence. And I'm going to give you a bit of overview just to give you context of the data that I'll show you quickly. So there's a couple different areas that the program was built on. One is the evidence we're going to use, so claims. Claims from the literature, and I'll go into that in a bit. But we also wanted to ground it in something, right? So a ground truth, in this case, replicability, serving as our ground truth for credibility. Now that, as I just mentioned, can take a lot of work. And we did, DARPA invested quite a bit to see how much evidence we could collect. But the goal was to see if we could scale. So another layer within the program was assigning confidence. Humans, we do this all the time, right, when we read. So that was another layer of assigning confidence scores kind of on a scale of zero to one. And then the aim was, well, let's see if we can really push this at the algorithm level using machines. So importantly, each of these levels are dependent upon each other. They're kind of getting validated. And they each have different issues of scalability. Reproducibility is very highly accurate, very difficult to do. Machines are the ones that we're trying to push to see how well they can map. All right, I'm going to give you a sampling strategy. We used the literature. This is a corpus, it's in the social behavioral sciences. So we looked at 62 journals across these fields. We created a corpus below that. So random selection from that, that gave us this huge sample of about 500 papers per journal. So 27,000 articles, empirical articles from the SBS literature. This was our basis. And then from that, we had to sample down stratified sampling to get to a more tolerable number since claim extraction itself is still a manual process. And I'll show you what I mean by that. But we have a data set of 3,900 papers that were the basis for all these predictions. All right, I just told you claim extraction is manual and it is. And this is a couple of different ways we did it within the program. One is what we call a single trace approach. I take a paper, this is just an example. Take a paper, it's got a title. And look in the abstract to find some key argument claim that's being done or a key result. You can find evidence below that that supports it. And really I'm trying to tie it down to a single statistical element. And you can see that we don't always capture everything and that's on purpose, right? We're hunting to see if we can find a single claim that everyone in the program can anchor on that is represented back up to what the paper is about but doesn't represent the entire paper though. As a result of that, we tried another approach. We call it multi-trace, right? Which is let's look at all of that evidence. We know papers, this is actually how we write, right? This is a more of an accurate description of what the literature looks like. So let's see if we can look at all of the evidence in the paper to see how that maps. So I'll show you a bit of both but that's why I wanted to give you this context. All right, well, besides having those 3,900 papers when we wanted to do the kind of objective, the ground truth of us actually redoing some of this experimentation, we had to go down a little bit farther. So we did shortified sampling to get it down to a more tolerable number, 600 papers. I laugh just because I say that's tolerable even though I know it's not. It's a ridiculously high number to attempt to interrogate. And then we even took a smaller number down to 200. And these are the two places where we focused on some of our empirical evidence. All right, so the first one I'm gonna show you is outcome reproducibility. So what I mean when I say this is I want the same data that was done in the original claim that supports the original claim, analyze the same way. So hopefully analytical data to code but there's other ways to get there. So this is what we found that single-trace approach across disciplines. It says over 150 reproductions that we did here and we kind of look at them different ways whether they precisely reproduced key aspects from the original paper, whether they didn't do it or even whether they were approximate. So within a 15% degree of error. And you can see here across the board, right? We don't see any 100% reproducibility that's occurring. There's another way to look at this data that I wanna show you as well, which is how we did the reproduction, right? So like I just told you, sometimes you get what I'm calling here author data, analytical data and code. So it's on the bottom and this shows you our success rate when we look at that, right? If somebody shares that, we have a much higher percent chance of reproducing it compared to if all I have is their data and no code and I have to reconstruct that analytical strategy or if we have what's the source data, right? Somebody goes to two different places of information, compiles that data together to create an analytical data set that they then analyze. When we do that multi-trace approach, now we're looking at over 500 reproductions that are occurring. So each dot here now is the claim, right? So our papers are being overrepresented. You see the same trend, right? You see that we have variation across disciplines in terms of how accurate we're able to reproduce it from this outcome sense. And you see the same trend as before between much easier to attempt to do that if you're looking at data code compared to reconstruct any part of that pathway. The program really anchored though on replication, right? And again, I'll use my definition here. Replication, what I mean is I want different data than was used in the original paper. So I either collected it again or maybe it's like an economics paper that uses existing data. I look for similar data that's out there, but I'm conducting the same analysis for a similar analysis again. What would we find when we did this? And again, the note here is that the yes, no, because replication is also very hard to define is defined kind of rigidly here. Is it statistically significant and in the same direction as the original finding? So again, these are 150 plus replications that we did and they all are sitting at right around that 50% mark. We can also look at it the two ways I just told you that replication can occur within the way we did it, right? I collect data again, send a survey out and collect new data in a similar sample. That's the new data collection that you see or if I find similar data that wasn't used in that original cohort that somebody else collected. So secondary or existing data, but you see the same trend, right? That type of replication doesn't really matter. And again, if we look at all of our replications across all of these papers, just at this point it's over 250 replications we did, you still see those similar trends, right? You see some variability across disciplines, but by and large, the replication success is kind of floating there between that 40, 60%. So this is in line with what we've seen in the literature and the same thing, it doesn't really matter if you're going with that new data or not. All right, so that's a lot of information that we collected and that we're gonna be reusing too, but this really led within the program of, well, how can humans do, right? How do they compare? Now that I told you it's not one or the other, there's a lot of variability in here. So these are some results that look at two different human approaches. One that you see there on the right, markets using prediction markets to basically have humans come together and see if they can kind of use a marketplace environment to determine how accurate they or how replicable they think the research is gonna be. The one in the middle there that uses what you heard yesterday if you were in the talk that Alex talked about, the idea protocol, basically having a type of peer review essentially occur in terms of assessing whether these experts believe that it'll replicate. And then a simple combination, just combining these two methods together. This graph is showing us their scores on that zero to one Y axis. And we're just putting there whether something was able to replicate in black or white. And these summaries that you see there, accuracy is just, if a replication is successful and their score was above 0.5 or if a replication failed and it was below 0.5, what was their accuracy? Or a different one, AUC, area under the curve that includes ranking. In both cases, you see that they're better than chance. There's still room to be improved, right? But it's better than chance. There's another way to look at this which is not just in those kind of dichotomous ways, but look at the probability of success. So how did those scores that they have change and are associated with the probability of replication success? And so again, you can see here that they both trend up and that the combination together gets really close to that diagonal linear line that we would hope for this type of measure. So they're good. The humans are able to do some of this and to do it at scale across disciplines. Well, how do the machines do? So I'm gonna give you just a little bit of evidence here because I'm running out of time. So there was three different approaches that were used within the program. And again, we just did a combination of all three that you can see there on the bottom left. And this is showing you, looking at just the expert scores because where we had the most data, compared to the three different algorithmic approaches. And there's two ways that were kind of used to assess this, just a simple correlation coefficient, the R or root mean square error. And what you see, especially in that combined one, is that there's some improvement. There's something that suggests that, yes, there's a lot of noise in this system, but right, maybe the machines are able to kind of tease apart a little bit of what the human teams are doing. All right, so now, because I just wanna make sure that I have time for a couple of questions before we end, which is how do we take stock from where we are right now? Well, back to that motivating question, right? Can we create these rapid tools? I think the evidence right now suggests, yeah, there's a little bit, right? There's some proof of concept. We're able to deploy this. We're able to see some indication that it's better than chance. But we really need to keep furthering investing in this. And that's a little bit of what we were hoping to do, besides giving you more information, some of this information out in the wild, is, well, let's actually try this. Let's get some user feedback, right? How would you guys feel if you submitted us a paper and we scored it? What would you think about it? And there's more than just the scores that I'm not showing you, which is there's a lot more about the rationale behind the score, the explainability as well, that's important. I also showed you when the program was designed for only really a couple algorithmic approaches, what if we do more? We talked a lot about bias. This is a great way to get at bias, especially in AIs, to have more different approaches, creating more of a marketplace. So as we release our data, I expect to hear from us that we'll be also looking for people who are interested in trying to build algorithms as well. We also know that this concern of replicability and credibility has nothing to do with social behavioral science. It's research at large. So can we keep pushing the boundaries, go into additional fields? And then the last one is, that's another tool, we just heard that, right? You created another metric, there's danger in that. So how do we convey this? There's a lot to unpack beyond a number. So how do we convey this if this is actually gonna become a deployable tool? The worst thing we can do is distill it to a single number when we know research is much more complex than that. So we'll kind of begin some responsible use research on that. And with that, I wanna thank everyone who did the work, which is not me, right? It's all the other teams that were involved in the program. So these are the teams that were involved and there were thousands of researches that either helped with the replications, the reproductions, or were participants in the human research, so, and I thank you for your time. All right, almost exactly three minutes for questions. Yeah, thank you for that interesting talk. I wondered if you'd been considering sort of hybrid human and sort of machine interloop iterative kind of approaches because humans can annotate, get feedback from the machine and then improve it. Yeah, that's a great question. Again, I'm not the person to talk to, but there is Sarah's here, maybe she raised her hand and she talked to her, there she is. Sarah is trying that. So spot on, it's a really good question, right? And then I think you were absolutely right, which is there's also not just how those interact, but there's that explainability factor that's really important in this research and this is a better way of getting at it. So, great question. Thanks. Other questions? Not sure whether I got it, so the accuracy for the prediction, look to me binary, did you consider Brio scores or something like that? Sorry, to consider what? The predictions, the accuracy, you measured in kind of a binary way, yes or no, and did you consider Brio scores? That's the... Yeah, no, I don't think so, we didn't consider that. I don't know if any of the different approaches were used, but you're right. This was very much done in more of a dichotomous matter, so there's a limitation there. Thanks, Tim. Any criteria that predicted good predictions? So if we were to use this and say, what should we evaluate papers on? Any suggestions of criteria we should use? Sample size, randomization, anything like that? I think no, I think you're asking the million dollar question, right? Like that's exactly the thing to do, and I think this is also the direction to go into, right? So a lot of this evidence that we have now allows us to begin to interrogate that, but I think also we'll want to, as we keep moving forward and try different approaches, is to recognize that we can also start to understand more of the research by having tools that are really kind of extracting more of the literature to try to get at this question, right? But as of right now, I think no, I don't have a good answer for you. Yes, I hope maybe we'll get some insights, and even if I give you insights, I'm sure they're wrong, so I'll just keep working on it. So lunch and the poster sessions are also going out in the Great Hall, so please grab lunch, maybe take a look at the poster. Thank you, Tim.