 Hello and welcome back everyone. I'm really excited to kick off our next session here at the neuroscience 2021 conference. Our topic today is on forecasting scientific outcomes and we have four amazing speakers that we're going to hear from today. The format just to kind of lay this out for the audience so you know what to expect. We're going to go through each of the four speakers first in about 15 minutes to kind of give their backgrounds all very different approaches at tackling this really important and emerging area. We'll be able to save the rest of that time for moderated q&a so as I've seen with all the other discussions and talks that we've had today, please add your questions into the q&a as you have them when they come up and I'll manage that at the end so please add as much as you can there, Hopefully we run out of time and I can spill over into a further discussion on remo and other aspects. So since we want to maximize our time I say we jump right into this. So our very first speaker today is Eve of the vault who is an assistant professor of economics at the University of Toronto. So if I go ahead and feel free to share your screen. Can you see this. Thanks so much for having me here it's really a pleasure. This is a huge topic and it's especially nice to be presenting with so many other people working in the same area. Okay, so today I'm focusing on using these as a forecast and research, and I'll be drawing upon experiences with the social science prediction platform. And in particular, I will try to just go through some reasons why somebody might want to collect forecasts of their research results and tell you a bit more about how the social science prediction platform is actually currently being used by researchers. So, roughly speaking I'm going to highlight these five general kinds of reasons this is not an exhaustive list, but I would like to argue that collecting forecasts of research results can help in evaluating the novelty and credibility of research results in helping in the long run to mitigate publication bias against it no results can help us answer substantive research questions such as about belief updating or provide priors patient analysis. And finally, they can even be useful in experimental design. So, let me just say a little bit more about these but I also want to flag. If you're interested in this topic. Some of these are taken from a policy forum piece in science that's definitely living yet I've been hoping I wrote a couple years ago now. Okay, but I think these actually go beyond that to it. So, evaluating the novelty and credibility of research results. How many of us, probably Elvis have been in seminars you're presenting and somebody says yes but we knew that already, or you're refereeing, and you, again, get the critique well you know what does this really add to the literature. So, gathering ex ante forecast can really help in addressing this kind of critique, especially since people are as we know subject to hindsight bias ex post everything is obvious. So you really need to gather forecast ex ante. And yet, I would also argue there's a subtle trade off between novelty and credibility that a lot of people don't recognize. And by this I mean, you know if you think of a really surprising result that really surprising result might also be less likely to replicate in the future. So, you know, I'm not taking a side here in terms of this is a double edged sword I'm just sort of pointing out there is a bit of attention here. So, you know there's many many uses of them and these are both two of them, potentially. One way that people may like to use forecasts is to really help to mitigate publication bias. So as we all know, significant results are so much more likely to be published, the null results where the null is zero. Okay. And yet, if you gather some ex ante forecast and that zero result was surprising, then that makes it a bit more interesting. And, you know, you can think that there's sort of theoretical reasons to really be wanting to test against what sort of the current status quo expert opinion is rather than testing always against zero or or against whatever you know other alternative hypothesis you've got like why don't use the forecast as a null. All that said again, I want to caution a little bit that if everybody sort of goes down this route, you could see over time, increasing slant in publication bias towards surprising results rather than null results. But nonetheless, it might still be a little bit more intuitive to compare against their current status quo, and people have been using them to answer substantive research questions so here's this example from a bunny at all which is on the platform. And do people know what share of articles and economics are published on race related research topics and they also have political science here and sociology. And so these are box pots that show the distribution of forecast and then those black lines are the actual true value so they ask people to essentially predict summary stats predict what share of articles were on race related research topics the way they defined it. And I've also got some work with it and coval where we ask policymakers how they update their beliefs in response to new evidence and again for that you kind of need priors, you can get belief updating without priors. And similarly, based on analysis really critically depends on priors, the question is always wish priors and having to defend a choice. And I would argue that expert forecasts is one principled approach to selecting those priors, and you're going to at all have an upcoming JDE registered report that compares Bayesian and frequent is in fact evaluation methods and they explicitly gathered priors for this patient analysis. And then in terms of experimental design. I mean imagine that you are the UK nuts unit and you want to study police retention but you've got 10 different kinds of things you can test. And then a few of them say you can only run three of them, which treatment arms do you run. Or you can think of this in the context of replication should you replicate that paper. And yet I would say this is the you know important people, even if you don't have many many treatment arms you know that's a common responses. Hey, I don't have that kind of freedom I'm evaluating one thing how is this useful to me and I would still say look this is still useful to you, because there's no choice about the outcomes. How many survey rounds you do which questions you ask how many times you ask those questions or how many rounds of data are you gathering for your repeated measurements. You can always think of priors as a potential useful input to your power calculations. And we're really lucky because there's so many other people here today talking about you know basically the, you look at the bullet point three DARPA score everybody is here. So that's fantastic. But there have been really a huge growth in people collecting forecasts individual research projects, collecting forecasts for their own study. And I noticed that Stefano da Lavignana decided to create this social science prediction platform which is a centralized platform to forecast research results in the social sciences. What is being forecast well, they can be summary stats like the mean or full distributions of priors from each forecaster, the forecasters. They can be other researchers, but they can be policy makers they can be members of the general public. It's really pretty open to the researchers. So as you can imagine, this is a pretty big endeavor. Here's the team. And the platform offers several advantages so it's nice to be able to coordinate learning about forecasts so gathering them in one place like this. We can a little bit mitigate the public goods problem in that, you know, you don't want to have a lot of people writing emails to hundreds of people trying to get them to forecast their own study that would not be sustainable. So it coordinates a little bit but also allows us to look across different projects and since it allows tracking of forecasters over time it gives us a panel. So, you know, we can identify super forecasters. Also, the platform provides third party certification of when the forecast were gathered and when they were made available. This can be useful if for example you want to plausibly very credibly say look, I gathered the forecast when I didn't even have data yet when I had no way of biasing the questions that were asked there. And even you can choose when they're made available to you because again you can say well, you know, they're made available only after I did the pre analysis plan or something like that so again you're binding your hands so you can argue that the analysis is kind of blind to the forecasts. If you want. It's an option. And the platform also of course just by coordinating makes it easier for researchers to collect and use forecasts we've got you know templates and such. And the survey pool essentially, and it makes it easier for forecasters to provide and learn from forecasts. I'll give you a little example of that. At the end of that study I was telling you earlier by Advani at all those race related papers. This was their distribution of the, of the values of people estimated for the share. You can see the true result that they entered the black line. The response mean is the green line and you can see you know my response this is actually mine but you know you see my response there. To say a little bit more about types of priors, we really have been leaving this open to the researchers that have been eliciting forecast so this can be results from field experiments or lab experiments. It can also be the summary statistics. And this includes pretty notably first stages or even estimates of model accuracy. So first stages think of things here like you know your take up rate of a program you want to know the overall effect of the program but that's obviously going to be influenced by what the take up rate is so you may be first have that as your first stage. So, actually if you look at what people are getting forecast on the social science prediction platform. Surprisingly to me, people are often asking for summary statistics, important summary statistics and this is even like after segmenting out the first stage there which is even more popular. So it's about equal share of summary stats and treatment effects. So that's actually pretty interesting. And who are the forecasters. Well, we leave that open to the researchers using the platform. So oftentimes people will want forecast from other researchers to sort of see what the discipline thinks of a certain issue right. Sometimes they can be policy makers. Some of my work with it in cobalt we've served in policy makers. Many others have as well. And sometimes you may want to get forecast from people like your program beneficiaries see what they think see if they can predict what's going to happen not necessarily people in you know your treatment group or your control group for your study, but people like them at least okay. You can certainly do that. Now total we've currently got over 2000 registered users, but that's sort of the registered users right and there's many more people who actually just sort of take one survey never fully sign up. They were invited to one particular thing say, but yeah. Now of those researchers who created a profile here as the distribution so you can see it's, you know, still dominated by economics. But we haven't really done much in the way of outreach to other disciplines really apart from like word of mouth and such so certainly there's much more growth to be had there but also it's just interesting that we've actually had a lot of interest 40% participation from other disciplines sort of despite that. So if you're in another discipline not seen here. Yes, you are welcome to use this. It's for any kind of social science. And then in terms of within economics, forecasters fields, whether again a mix, mostly various kinds of, you know, applied micro behavioral development. There's a mix. Okay. Of course again we don't have these data on anonymous users. In terms of incentives. Again, we leave this up to the researchers who are listening forecast everything is really geared towards making this something that researchers can use, valuably in their own studies. So, we allow both incentivized and unincentivized studies. If it's incentivized researchers then pay their own respondents. We're also working on building in more public recognition of good forecasters. So, in the near term future there will be a leaderboard that you can sort of see how well people have been doing. Just to give a bit more background, the way this typically works a researcher will design a study collect baseline data if they are going to do that. Use that to then inform their forecasting survey oftentimes in your survey you want to have some kind of baseline stats or something to make it more concrete. And then you send that to the online platform, the platform distributes the survey, you know you as a researcher go off you gather your results data. And whenever you pre specified that you would want those forecasting survey results to come back to you and could be immediately it could be like you know as soon as they come in you sort of watch them come in. So that is, those results are released back to you, and then when you have study results, you upload them to the platform, and those are released back automatically to all the other forecasters so this is also nice, like whenever you've got, you know in some IRB thing you say, well I'm going to share results back with the participants at the end of the study this helps, right. This is the general flow but then this is just a typical flow there are deviations so say you wanted to use this for designing your experiment in the first place well then obviously you're going to want to get the forecasts first. Right. Okay. So there's multiple distribution options we've got our own pool you can get them sort of passively from people who go to the website. There's personalized email links to target certain groups. And you can make different subsets of these right as you can have multiple versions multiple, you definitely can track, you know where people are coming from. And you can also distribute it of course through an anonymous link. So here is the site. I hope you check it out I hope it can improve social science in the long run. And there's the URL and contact info for more. Wonderful. Thank you. That was, that's great. And I know there's more that we could go to there actually we're going to switch like we said we're going to hear from everyone else. Again, everybody keep thinking through the questions that you have because I think we'll come back together and thank you also for forecasting the fact that yes we have other panelists here that are doing some of these exciting projects that you mentioned like score. This is Fiona Fidler who's a professor at the University of Melbourne split between the School of Biosciences and the School of Historical and Philosophical Studies. And so also is coming here very early in the morning for us so thank you. Hope you got your coffee and go ahead take it away. My coffee machine is broken. It's a disaster. And Eva's talk was so clear and well paced and and this one may not be so we'll see how we go. I'm talking today about the replicats project so the cats part of this acronym stands for collaborative assessment for trustworthy science. This is a big project. Here's a wonderful team that is that is running as part of the DARPA school program and in fact all of the talks that follow me in this session will also be referring to the DARPA school program so I'll talk a little bit about what that is. This is a structured elicitation protocol in replicats our platform, our community, and the mathematical aggregation that we're using so one feature of replicats is that whilst it is sort of a structured deliberation process and not unlike a Delphi process, we rely not on behavioral consensus but instead on the mathematical aggregation of individual judgments. We'll talk a little bit about prediction accuracy in our pilot and preliminary results. And then I just want to spend a couple of seconds minutes talking about where this project is going beyond just forecasting replicability which has been the primary focus so far, and how we might use something like this structured elicitation protocol to kind of reimagine how peer review works. So there are two new acronyms on every slide here. I'm sorry about that so DARPA is of course the research agency of the US Department of Defense and school stands for systemising confidence in open research evidence. So this is a very large program that has three, what they call technical aspects, and the first technical aspect is actually coordinating the ground truth of the replication studies. This is run by the Center for Open Science. They're doing a lot of other things for this project as well but that's the, will be my focus today. Technical aspect two is where the replicats project sits and this is about eliciting expert predictions about the outcomes of hypothetical replication studies. So unimel replicats that's us. There was a second team in the first phase of this program in this technical aspect called replication markets and Anna, who is speaking after me will talk a little bit about that amongst other things. And the third technical aspect is really sort of the blue sky part of this program, which is about using machine learning and AI to develop algorithms that can make these kinds of predictions about replicability and other features of credibility to basically assign confidence scores to the published literature. And Sarah, who is talking after Anna will be talking about that aspect of the program. Okay, so most of what I'm going to talk about today is what we did in the replicats project in phase one, which ran until the end of about the end of last year. And in that first phase we elicited predictions about replicability of 3000 published articles articles that were published in 62 journals across eight different social science disciplines, business criminology economics education political science psychology public administration and sociology all of these articles were quantitative articles and we're not yet up to looking at qualitative research. And we did this by relying on an amazingly diverse and dedicated pool of participants we've had now. Now in our second phase of the project this 550 years actually about 700 participants from more than 40 countries across eight of these domains and I can see in the attendee list that a number of them are here today so thanks. And we've run a combination of face to face workshops at conferences of course over the last 18 months that's moved to online and remote elicitation. And we've had partnerships with various a early career researcher journal clubs, most notably the reproducibility journal clubs. For each of these 3000 articles, we've had between four and six assesses assess the primary research claim of the article so this has given us a huge amount of data of thousands of individual quantitative judgments and and explanations and justifications of those quantitative judgments so we're also we're listening quantitative judgments but also reasoning and justifications that sit alongside that which I'll explain a little bit more. Now, so our elicitation method here is is not a prediction market and we'll be talking about prediction markets ours is a structured deliberation protocol so it's closely related to a delphin multiple with some clear differences. So, participants first make some private first round judgments about the claim that they're assessing. They then share those judgments with other members of their group and they have a discussion and this is a facilitated discussion. The group members are encouraged to interrogate each other's judgments to share information. The point of the discussion is not to reach a consensus. It's to think about counterfactuals and explore new ideas. So after the discussion participants make a second round or updated set of private estimates, which we then mathematically aggregate using a number of different aggregation models. So the important difference here the important thing to remember about this is that it's not a consensus. It's not a model that focuses on consensus. In fact, we encourage disagreement and new ideas in discussion. So then we put all of this protocol together in a custom built platform so you can see on this side you have in on the left side to have information about the article including a link to the full text of the article some summary of the results. And then in the middle pane in the first round participants answering questions what you're seeing here is a view from the discussion round where participants can see each other's interval judgments. So the question in the middle of the screen here is what is the probability that direct replications of this study would find a statistically significant effect in the same direction as the original study. So here you have answers from six different participants sorry five different participants you can see their interval judgments, so you've got information about how certain they are about that probability judgment. And then on the final pane are their comments there were the reasoning and the justifications for those quantitative judgments that they're given. So our platform has is user driven it's interactive it's social this opportunity for our assessors to get feedback and calibration so they get to compare how their, how their judgment stack up against others. And this this process we think is intrinsically rewarding we've also added some kind of gamification little badges that participants get along the way which they seem to like. So this is a picture of what our community dashboard on the platform looks like. And importantly, the badges. Here's some more the badges, some of them count up how many claims people have assessed how many articles they have assessed. But importantly, we also reward badges for the types of behaviors that we want to encourage so participants can earn a badge by looking in the glossary to check out terms, they can earn badges for interacting with other participants comments and reasoning. So these suppose these badges are designed to encourage the sorts of behaviors that we want to encourage so replicats community we have, we have aimed to recruit for diversity. So as I said we now have 650 active participants from 40 different countries, they are predominantly PhD students or early postdocs early to mid career researchers and but we do also have some professors across across the whole spectrum of the disciplines that we're interested in and even some people outside those disciplines. And one thing we've noticed over the course of the workshops that we have run during this project is that there is a real community capacity building aspect to this that this process that we've developed not only elicits predictions about replicability but is functioning essentially as deliberate practice for peer reviews a particular early career researchers have have felt benefits from this and and have returned to workshops a number of times to fall this aspect for this kind of training in peer review the ability to get feedback and calibrate their their assessments against other people's as I've mentioned we're interested the way that we deal with these individual responses the way we turn this into a decision is by mathematical aggregation not by any kind of behavioral consensus through discussion. So I want to talk now about some of the models that we're using to do this so what you see in our platform here what's fed back to participants is a simple unweighted mean of their responses. And that's that's sufficient we think for this feedback part of the process, but when we are when we are aggregating the final judgments we use a series. We have a whole suite of aggregation models that we're looking at 22 of them in fact now I'm not going to talk about each you'll be relieved to hear that I'm not going to talk about each of these in turn. But there are three basic categories of aggregation models that we're working with so the first red box up the top here are basically different different versions of arithmetic means weight and medians. The final box down the bottom is Bayesian models, and the box in the middle is what I'm going to spend the most time talking about today these are models that have weighted judgments by various proxies for performance. So obviously, we don't know beforehand one of the very difficult things about forecasting is trying to judge beforehand beforehand who is going to be a good forecaster. And we know from lots of previous research that traditional measures of expertise number of publications use of experience, turn out to be not very good indicators of prediction accuracy. So we've said about a project of trying to find other things that will be better proxies for performance. So some of the things that we've been exploring and these models are informativeness which we operationalize by the interval width of participants judgments so we can weight more heavily judgments that express a lot of certainty and weight ones where where the intervals are very wide. We have a proxy that's basically prior knowledge so we give participants a quiz before they start these workshops asking them various statistical and methodological questions and we can wait by performance on that quiz. So that's basically how extreme or how asymmetric estimates are we've had a theory that perhaps more extreme or asymmetric estimates were indicators of, of more information. So that's a theory that we've been able to test, we can wait by open mindedness so whether participants, how much they shift their judgments between round one and round two we think open mindedness or integrating new information is likely to be a proxy for good performance. And we can wait by engagement the, you know, the amount of time they spend on the platform the number of words they write, and we can wait by reasoning. So because we are collecting those open ended justifications and reasoning responses from people we can qualitatively analyze those and use those to wait people's use those as a as a proxy waiting proxy. So here's an example of some of the things that we've done so we in some of these aggregation models we've, as I've said we've waited by interval width or asymmetry so whether the best someone's best estimate is just in the center of the interval or whether it's off to the side. And we can do various transformations on means, and so on. We can, as I said we can wait by openness to changing your mind by prior knowledge so performance on that quiz. We can wait we ask people a comprehensive comprehension question as part of the elicitation and we can wait by, by that by their self rated understanding of the paper. So, oh, it's a little funny thing on my slide. So now I want to talk through just quickly some of our pilot and preliminary results our accuracy results. So how well this elicitation model. We're using is performing. This is this data is from a pilot study that we did so this is. This is data from we took 25 studies these were studies not from the school program corpus this is 25 studies that had already been replicated in previous replication projects. And we asked five of our participant groups to go through the regular replicats protocol to assess these and our classification of what we've called here classification accuracy using the simple unweighted mean for these studies was was 84%. So by classification accuracy I mean if the if participants said that the replication study would be statistically significant in the same direction as the original study and then it turned out to be that case then that would be an accurate classification. So this is a just a small pilot that we did and the results are up on meta archive. And I'm turning now to the accuracy the preliminary accuracy from our phase one score results so what we see here. What we see here is all of our different 22 aggregation methods along this x axis and on the y axis is an accuracy measure called AUC area under a curve. And this is the primary accuracy measure that's being used as part of the school program. And there's quite a lot of variation as you can see between the 22 aggregation models that we've used some of the things that we thought would do really well like openness to changing your mind. How much people shift between round one and round two turns out to actually be the worst possible aggregation model in this set. And our best, our best models were the ones that relied on reasoning, our reasoning analysis, I'll talk a little bit more about how we did that reasoning analysis in a moment. But basically these, these two relied on that reasoning weighted participants judgments by how many independent reasons and justifications they gave for their quantitative judgments. And in that case, our AUC was 0.78 and class classification accuracy was around 74% another accuracy measure that we can look at here is briar score which is a proper scoring rule and our briar score here falls below 0.2. Okay, so this is another view of the same results. What we're looking at in this table is comparing our predictions against 50 replication outcomes so the first 50 replication outcomes provided to us. And we were looking again for a target value of target AUC value of 0.7 so here in this table are our 22 aggregation methods again and you can see that the first 18 of them. Meet that target AUC of 0.7. And, and then there are a few that don't. And same as in the figure before it's the reasoning ones that the ones that wait by reasoning that sit at the top of that. In this next table, we're adding some different kinds of replication studies so not just direct replication studies but also some data analytic replication so replication studies that look that use pre existing data sets to addressing the same question and some computational reproductions and when we sort of expand the definition of replication to include those things our accuracy drops off pretty dramatically. And as does the accuracy of most other methods, I believe, but we still have, but what we do see is that those those models that wait by reasoning stay at the top of the list. Okay, so now I just want to talk a little bit about what this reasoning what these reasoning models are so basically to create these models. We take the open ended comments the justification that our participants provide on the platform or in their discussions, and, and we qualitatively code this so here's an example of a participant's comment and of the reasoning codes that would have been that we've created with this particular comment. So each of these boxes counts as an independent reason that this participant has given for their particular quantitative judgment about likely replicability. Sometimes they talk about other credibility aspects, not just replicability so they'll refer to generalizability or inference validity and so on, and these also get coded as independent justifications. So in the end we can build up reasoning code book. This is an early version of our code book is quite a comprehensive list of reasons that participants give for these judgments. And we can use these different categories of judgments, the number of these different categories of judgments to wait to wait our aggregations. So that's, that's how we've, that's kind of a summary of what we've done in phase one we're now in phase two of the school program, where we're kind of moving beyond forecasting replicability and looking at other credibility dimensions. So in this phase we're evaluating a suite of credibility signals, including transparency robustness validity generalizability and gathering overall credibility ratings from participants about paper as a whole. We're all there's another difference in what we're doing in phase two as well so first of all we're not just assessing replicability we're looking at this suite of credibility signals. And we're also looking at the entire an entire research paper more holistically rather than just having people assess the replicability of a single claim. So there's a suite of signals and we're also looking at the, at the whole network of claims within a paper. So I just want to take this last minute now to, to kind of talk about how we might use something like this structural elicitation protocol in peer review so peer review is essentially, you know, if we think about this peer review is is essentially a group elicitation and decision exercise. And at the moment our traditional peer review protocols, elicit unstructured narratives that are often subjectively aggregated by a single editor, often without transparency or oversight. There are some existing models of interactive reviews and we see these at frontier journals and, and other and other journals including a life as well, but many of these not all of them but many of these interactive models do rely on some kind of consensus formation technique and often this is consensus without a defined endpoint which essentially means it's consensus by fatigue, it's whichever reviewer gives up first, the other one wins. In our protocol, we feel like we, we think we've overcome many of these obstacles we have an inherently collaborative and intrinsically rewarding protocol so we talk about external awards for peer review often but this is a process that actually provides intrinsic rewards as well has inbuilt training for reviewers and calibration it's not driven by behavioral consensus has a defined in endpoint. We get out of this quantitative judgments and qualitative reasoning, and it encourages reviewers to interrogate each other, and it offers transparency around the aggregation process. You can access a demo of our platform if you're interested on our resources page on our replicats page at the University of Melbourne. Okay, thanks very much. Wonderful thanks Fiona, really exciting to see the progress that's that's been occurring throughout the entire program that we've been working together on. Alright, we're going to go to our next speaker, keep us moving along. So very excited here to have and again, swing in the time zone entirely the other direction so very late at night so thank you. So we have Anna Draper here who's a professor of economics at the Stockholm School of Economics. So go right ahead. Great. Thanks so much. I'm super happy to be here. Thanks to the organizers and super interesting talk so far and looking forward to your Sarah. Okay, so I'm going to talk about work that is joined with lots of people. So here's the list of the co authors that have been involved in the projects that I will now talk about so they are many they played super important roles and if there are a couple that I can highlight it would be Thomas Pfeiffer who's a computational biologist at Massey University in New Zealand so it's early Friday morning for him and I think he's here and the other person is minus Johannes on who's an economist here in Stockholm and I am guessing that he's sleeping at this hour. Okay, so we have been doing these projects where we've been looking at prediction markets in science and whether we can use prediction markets to try to predict in particular replication outcomes but increasingly also novel hypothesis. So then you might be wondering what are prediction markets. So prediction markets are basically these tools to aggregate information, you want to understand what types of beliefs people have about something prediction markets can be one such tool. So they've been used quite extensively to predict outcomes related to politics, entertainment sports and other types of events. And here's an example from a political prediction markets. So in this particular market you could bet on who would win the 2008 US presidential election. The two main candidates are Obama and McCain, a contract in Obama is worth $1 if he becomes president and $0 if he does not become president. So the price for this type of contract can be interpreted as the probability that the market decides the events. So the probability that the market thinks Obama will win the election. If you think that the probability is higher than this price so higher than 91.5%. You should buy this contract go long in it. And if you think the probability is short is lower you should go short in it short sell it or buy the opponent's contract. So the potentially good thing about prediction markets are that you're not asking people who would you vote for who do you want to win, but you're asking people who do you think will win. So potentially you can aggregate a lot more information here, because I'm just I'm not going to be betting here just according to who I want to win but according to what I believe and these are things that are functions of what I hear read and see so potentially this is a way to aggregate a lot more information than say an election poll. And there are many studies comparing prediction markets to election polls, and the markets typically perform as good or better than polls but not always. We're not interested in election polls we're interested in using these types of markets to try to predict revocation outcomes. So in these studies. We basically linked prediction markets to replication projects, and I will go into details and tell you more which ones in particular. So the main thing we're asking people to predict this the simple simple it's a binary events. Basically what the replication result will be an effect in the same direction as the original study that will be statistically significant in a two sided test. Yes or no so a binary event or better or worse. So we open these markets up for 10 days to two weeks. We invite people to participate. People are typically researchers maybe some of you have participated effects if so. In these markets we've had between 30 and 200 participants at each time. And when you enter these markets, let's say there are 20 or 25 studies you can bet on so you bet on whether a study will replicate yes or no. You don't have to bet on all of them you can choose to self select into those markets or studies which you're interested in. So 30 to 200 participants that doesn't mean that we have 200 participants betting on each particular study. There are studies that many people are interested in and studies that fewer people are interested in and trying to predict for various reasons. So we are not asking people to put their own money on the table, we're giving that 50 to 100 US dollars each that they can then use to trade on these replications. We recruit researchers mainly through email lists replication projects and similar things. Okay, so the main replication projects that we've been collaborating with in various ways are the big replication product in psychology the reproducibility project where we run prediction markets for a subset of the 100 replications. So we have replication outcomes and prediction market prices and surveys that I will also mention soon for 41 out of the RPP studies. We also added prediction markets to the many labs two and many labs five projects, which gives us another 24 plus 20 replication outcomes and predictions. We performed replications and also predicted replication outcomes for experimental economics. Here we looked at 18 studies from two top econ journals for a time period of 2011 to 2014. We replicated studies and we tried to predict replication outcomes. We also did a project again joined with many people where we tried to replicate major science studies published between 2010 and 2015. Here we replicated 21 studies and we tried to predict replication outcomes in various ways. So in these markets there is this one central hypothesis for each study. So that's the one that is being replicated and that's the one that we're trying to predict. So in these markets participants are trading contracts that, similar to what I described earlier, pay $1 if the study replicates successfully according to the definition, which typically is that we find an effect in the same direction as the original study. And this effect is statistically significant. If the study does not replicate according to this definition, the payoff from this contract is $0. So we interpret the price of this contract as the predicted probability of the outcome occurring. So the predicted probability of the study replicating. Some details that may not be super interesting, but we use a logarithmic scoring rule. So this means that we are the counterparty to trade with in the sense that you don't have to find someone else take the opposite side of the bet. We are the counterparty. We allow for both long and short selling. So if you enter these markets, you think that the probability that a study will replicate is higher than the current price? Okay, you should go long in the contract and buy it. If you think the probability is lower than the current price, you can short sell the contract. Before participants bet in these markets, we give them typically replication reports. And I would say the further we've done replication products, the more information they get. So these replication reports detail what the original result is like, what we're planning to do with the replication, and how these things differ. And that's mainly when we've been doing the replications ourselves and maybe we've given them slightly less information than in projects where others have been doing the replications. We start prices at 50 in these markets and then prices can go up to 100 or down to zero depending on how people are betting. So depending on what people are thinking or believing about replication outcomes. So here's an interface from our experimental economics markets. So you have the studies and they are summarized by basically author names, journal and year of publication. You see the current price which will vary between zero and one and the higher the price, the higher the probability that the study will replicate successfully according to the market. So if you can have shares of course in these markets in these studies. And if you click on a particular one, you can look at how prices have been changing over time. And then you can choose to go long or go short in this particular study. Okay, so before participants enter the market we asked them in a pre market survey how likely do you think it is that this hypothesis will be replicated and we define what we mean by replication outcome here. And we typically ask something about how well participants know the topic. We know something about their level of expertise and that's a variable that typically doesn't help us at all in predicting results. Okay, so what are the results. So here I'm showing you 123 replications for which we have replication outcomes. We have prediction market prices, and we have survey beliefs. So I'm thinking about the work that we're doing with Thomas Pfeiffer, Felix Holtzmeister and others, Michael Gordon, that we're basically adding more and more data to these projects. So to sort of update these figures with more information. So right now this is the current situation and this is summarized in forthcoming paper by Nosek et al. So, in the simplest type of analysis here we basically say that if the price or prediction market price is above 50 or survey beliefs are above 50, the market or the survey thinks that the study will replicate. In black here in solid dots we have successful replications, meaning that the replication finds an effect in the same direction as the original study, typically, that is significant at P less than 0.5 and a two sided test. And from just looking at these figures you can see that there are two more solid dots above 50 than below and more non solid dots below 50 than above. But with this in this type of analysis, we find that when we pull these prediction market studies we find that there is some wisdom of crowd going on, but it's far from perfect. But there is something going on, so there seems to be something systematic about results that successfully replicate versus fail to replicate, and the markets are pretty but not perfectly good at picking this up. So in this paper that I mentioned by Nosek et al. We also show how prediction markets perform relative to machine learning models trying to predict the same outcomes. So these are from three machine learning papers they came out recently. So these results suggest that these machine learning models they perform pretty well, but they did not perform better than the prediction markets and perform equally well or worse. But that's likely to change once there is more data and probably more training data etc. So I think Sarah can tell us more about that soon. Okay, so when it comes to the DARPA score markets here I played a super tiny role so the person to talk to is Thomas Pfeiffer, who is in the audience and maybe he can jump in if there are questions of this. But as Fiona was saying, there were many studies to be predicted here. So we set up prediction markets for 3000 studies and accuracy is then being evaluated on a subset of these because all of these 3000 studies are not actually replicated. So what the results suggest so far is that performance is not very good. I mean it seems to be better than random forecast but it's far from great and it's less accurate than our previous smaller projects that I just showed you. And if anything people are better at forecasting within their own field compared to other fields. So the biggest problem with these markets that we experienced was that there was very little trading. In our previous projects we had researchers who participated for a limited amount of time for a limited amount of papers to predict. Here there were many papers over a long period of time and very few people stayed on and predicted papers repeatedly. So thin markets was the biggest problem in these studies. So now we're moving on to other types of forecasting. So we're still trying to predict replication outcomes but also new hypothesis and the focus in most of our current projects are more effect sizes. So not just on this binary, will it replicate yes or no according to a specific definition. So we're mainly looking at researchers forecasting so similarly to the prediction markets that I described earlier compared to the DARPA score prediction markets. We typically have some small monetary incentives in these studies but there is small and there's not much evidence from our studies suggesting that they actually matter. But being an economist and talking to economists have to convince economists that what we do makes sense and then small monetary incentives are better than no incentives for sure. We also typically give participants in our new forecasting studies consortium co-authorship. So when we ask participants to try to predict outcomes that typically means that they will do a lot of work for many, many hours of going through material and actually looking at exactly how the research design will be like, what are the instructions to participants, etc. And then try to predict what's going to happen in the studies. So we decided to use consortium co-authorship in order to reward these forecasters and I think that makes sense given the amount of work that they have to extend. But yeah, some people might disagree here. And these projects overall I would say that we typically find these fairly strong positive correlations between beliefs, forecasts and what we actually find for both direct replications, conceptual replications and new hypothesis. But of course not perfect correlations between beliefs and actual results. And also this is joint work with Eric Goodman, who's at INSEAN Singapore. So no idea whether he's awake at this hour. He should be. But I don't know. Okay, so I'll skip going into details about that and I want to end by talking about our new project, which is about to be launched. So perhaps if you're interested, please participate. So when we're thinking about replications, and I mean I'm in the world where we try to do some smaller replication projects. So not like the RPP, but more like the experimental economics replication project or the nature and science replication project. There are just so many potential applications to be run. So what should we actually be replicating. So, instead of us just using some decision rule like time period or specific journals with some method. We are basically now thinking about how we can use markets help us decide which replications to run. So we have chosen, after having gone through all PNAS papers using M-Turk experiments during a longer time period or a few years of time period. We ended up with 41 PNAS papers that we can potentially replicate in the sense that we have software instructions, etc. We don't want to replicate all of them. We want to replicate 26 out of these. So which 26 will that be that will be a function of predict of market prices. So these will not be prediction markets, but they will be decision markets. So, depending on final prices, that will affect what we will actually replicate. So all of these 41 studies have a positive probability of being picked for application, but not all of them will be and some will get extra weight in sort of the lottery for being picked. So once that will be that will depend on our decision rule. So we can use a decision rule where we are the most interested in results in replications for studies where markets have prices closest to 50 because the information value will be the highest from replicating those studies. Or we can go for studies that are the most likely to be false positive results according to the markets or something else. So this is very much work in progress, but the markets are hopefully opening on November 1. So we'll start advertising the markets on October 1. So please participate if you are interested and just send me an email, or if you want to talk about any of this at some other point or collaborate or just say what you're annoyed about just send me an email and thanks a lot. That's it. Wonderful. Thank you so much. Moving along here, and then we'll open up for questions at the end. So our next and last speaker here is Sarah right Meyer who's an assistant professor in College of Information Sciences and Technology at the Pennsylvania State University. So Sarah you should be able to share your screen if you want and go right ahead. All right, does that look okay. Thanks everybody. Thank you Tim I'm really glad to be part of this discussion. I will talk about some work also part of the DARPA score effort and Fiona really nicely illustrated the three task areas and pinpointed where we sit in the third task area which is sort of looking to develop machine learning algorithms to score confidence in published claims. Um, so our effort is called synthetic prediction markets. We are using bot traders within these artificial markets to determine experimental reproducibility. Our team is large. I've listed some of the key personnel on this slide we are spread across four universities, Penn State, Texas A&M Old Dominion and Rutgers. The team is mostly computer scientists applied mathematician and economists. And I also want to note that within the task areas that Fiona outlined, we are one of three teams in the third task area so another team is led by two six labs, and another is led by researchers at USC. Both are taking different approaches, both based on knowledge graphs, but I will focus on on this approach which actually was nicely sort of the groundwork for this was nicely laid by Anna's talk. We were very much motivated by the success of the prediction markets that Anna described. They work well to predict replication outcomes, or they work pretty well, but their time and resource intensive. They are subject to some limitations with respect to just the scope of information available to the participants and maybe some biases that they could bring. So, when we propose this in 2018 to the DARPA solicitation there had not been these papers, trying machine learning models for the task. But what we knew is that the machine learning model we wanted to create and that DARPA was asking for had to have some key characteristics so we knew that we would not have a lot of training data, so to speak. And so we did not want a data hungry model, which we've sort of accomplished and are sort of still struggling with. And another thing is we wanted to have a generalizable machine learning approach so we know that the the scientific literature is changing quickly, and we wanted to be able to score new claims that might come in that might be pretty different from what the previous training data looked like. And thirdly, and really critically DARPA was asking for explainable approaches. And so we know there's a lot of work in explainable AI, some of that is on our team. But we really wanted something that we could output sort of human understandable explanations for the scores we come up with. And I think, this project moves forward I realized that's even more important than I realized at the beginning, because when you look at the kinds of nuance that is brought to the table from, for example, the replicats surveys. It's not just one score. It's not just will it replicate its generalizability and robustness and many other nuances that are really important beyond a single score and so at this point, you know, our explanations are our best ways of trying to give insight into that in a purely machine learning model. Our approach, again, motivated by the success of the human populated markets was to develop a fully synthetic prediction market and this is what we prototyped in phase one so the TA3 teams started a bit after the TA1 and TA2 teams. And we so we're about two years in to the project, the first year and a half was phase one, which we basically this is the pipeline. It's a little, it's a little complicated but we start with a PDF, and we extract what we can from the PDF so I would say half of our effort is in feature extraction, and half of it is in the actual AI. And it's still a really important question for us, whether we're extracting the right features, and I'll get more to that on the next slide. Once we've extracted these features we provide those features as information to our agents or our bot traders. And our agents use the information that they have to determine whether they would like to purchase contracts representing a will reproduce or will not reproduce asset. And we iterate the market for some time and I'll discuss that as well. Theoretically the market converges in a purely theoretical world, but at the end or at the close of the market we have a spot price for the will reproduce asset which we take to be our score our confidence score. And we also have a few different ways that we assess our confidence in our confidence score. At the bottom of this schematic are these orange arrows and the orange arrows are the training process. So in the beginning we did not have any data from the replicats or the replication markets those were the two to two teams. And we used known replication projects like the ones in the described. We also used other signals that were very imperfect but we thought might have some signal for reproducibility so we used papers that have been retracted from the production watch database. We also use papers that had been pre registered. And a few other things. And we use these to train our agents, basically using an evolutionary genetic algorithm where agents that did poorly were pruned from the system. So moving to now our hybrid prediction markets so this is something I'm really excited about in our approach. And that is in the winter, we'll have human participants participating alongside our bot traders. Okay, so I'm animating here with some human beings in this loop so you can see where they were where they will be. You know there's a lot of open questions and I think a lot of interesting opportunities for for fundamental and like also just very practical research in this regard. First we want to see whether including humans in the loop improves the performance of our markets. So we want to be able to train our markets with human participants occasionally, but be able to deploy them offline so one of the requirements for the TA three teams from DARPA is that we are able to assess a claim within 30 minutes. So we can actually run these human participant hybrid markets in 30 minutes, of course, but we pitched this idea that maybe we can do it every couple months or every now and then. The actual appropriate timing of that we don't know. And then deploy our system in a fully synthetic fashion offline. And a couple challenges we hope we can also address using this approach one is, we have a lack of agent participation so it does sound sort of mind blowing because you would think we have control of these agents. And you know the same issues that come up with lack of participation in human prediction markets. And I'll, I'll describe those a little bit to, and we're also hoping that this will allow us to have some signal for unusual new claims so when we have a test point that's far from our training points, maybe the signals of allowing a human experts in where we need them can help us kickstart the system. The larger I think questions just from the AI community is, you know, how can we maybe capture the best of both worlds here is there any way that our agents could capture some of the human intuition, the wisdom that the experts bring to the markets. Of course while our agents have sort of unparalleled view over the whole literature, which, which could be a really wonderful combination. Also, just to make to make our lives more interesting DARPA throughout us a few months ago this idea that maybe we could have a touring test of sorts which we found super exciting. So, ideally our hybrid prediction markets would give us the opportunity to understand how human like our agents are challenges that we face or we need to address are a few. So, so we need to make sure that when we incorporate our humans into our prediction markets that we can disentangle some of the effects that may be occurring due to the hybrid setting. We also are now determining what we should show humans so we show our agents the features that we've extracted from our papers. It's not clear that we should show the human participants those features in that form, or just let them read the papers themselves. We're also worried about reintroducing some of the biases that we had designed the synthetic system to skirt. We're also worried about retaining the explainability of our assertions. So, again, one of our main motivations for this approach was that we thought the trade logs could be used and sort of wound back to be an explanation of the system outcome. When we introduce human participants, of course, we have to think about how we would explain their interactions and their effects on our eventual scores. The features are really important. So like I said, we have to make sure that what we're extracting from the papers and what we're providing to our bot traders is the right information and it's meaningful. In the first phase of our work, we started with a lot of features at the paper level. So these are things like author related features or bibliometric features, co authorship networks, things like that. And there were also statistical features about the number of the sample size and the p values and so forth that we extracted. The kinds of things that the experts, human experts would look for in the text are our features that were not really represented in our system. What we're trying to do to sanity check our features is we're trying to say, okay, if our system is performing well or not well, how much do we attribute that to the features and how much do we attribute that to the market, the AI. So what is the source of success and failure. And the way that we've tried to do that to print in principle assess that is to have a red team. So the team within our team, the sub team at Texas A&M uses the exact same features, and we don't really interface with them we really wanted to keep them. So they share the features but they assess all of the papers that we assess with our market using a different just baseline machine learning approach. It's a baseline machine learning approach but it's actually super sophisticated interpretable machine learning approach there at the moment using something called a neural structured data regressor. So we can say that at the moment the red team is outperforming us, as well as in our latest evaluation all of the other TA3 teams. So, it's really tough baseline but very interesting for us because we can really see whether it's the features that or the market that's determining our performance. We are starting to integrate more features with a particular focus on claim level features so you may have noticed when Fiona was discussing the TA2 effort they're assessing what's called bushel claims. So, rather than looking at just the single primary claim of a research paper. The TA2 teams are now assessing the multiple claims in a single paper. If we, using our approach are to evaluate multiple claims in a single paper and give them different scores, we really need to make sure that we have rich features at the claim level, and not just features that represent the whole paper. And so, we're working to do that. And we've also started to collaborate with the other two TA3 teams who have very sophisticated feature extractors for some of the language around the claims themselves. The market itself is a synthetic market using a simple binary option so will reproduce, won't reproduce according to some established definition of what that means, and also using a logarithmic scoring rule. The agents themselves are located in high dimensional feature space and essentially they will buy an asset of will reproduce or won't reproduce based on the location of the test claim and whether it's in their region in feature space. In the moment we're using ellipsoids so we are also looking to expand these regions to convex cones for more generalizability. What you see with the animation is that these regions in feature space where the agents will bid our time varying and based on the price so as you can see as the red regions are growing, the blue regions are shrinking. So, red being will not reproduce and as the market price for will reproduce goes down, the agents are more likely to buy an asset or the space around them where they will be willing to purchase an asset increases. In a purely theoretical form and this is the paper that I linked at the bottom of the slide, we have proven some things that are sort of mathematically interesting about the market as it is modeled with an ODE. In practice we make some modifications to this so it's not really as neat and provably clean as in the formulation in our paper. So, for example, we only allow an individual agent to purchase one kind of asset either a will reproduce or won't reproduce this specialized. Not all agents can bid at every round they get called up randomly. These are things that we're still experimenting with. So when we think about evaluating multiple claims per paper on top of the question of the features that we would need to extract that are specific to the individual claims. We also have to think about how we would want to evaluate them in the market so we could just take each claim and just evaluate it separately and forget all of the dependencies of whether they're in the same paper or not. In the same paper if one might have been dependent in some way on another, but we're now looking at combinatorial markets and other ways that we could preserve some of the meaning and the dependencies amongst the claims in a given paper. Our bots uses an evolutionary algorithm during the training process. So, agents that perform well are kept and they could replicate and evolves agents that are performed poorly are deleted. So far, we have different ways of assessing ourselves so we, we initially assessed our market on the known replication projects so the RPP the SSRP ERP, many labs many labs to these are the projects that Anna mentioned. So a total of 192 papers, what you'll notice is something very, very unique to our market which I could, I could try to argue is a good thing, but it's very nonstandard to machine learning, which is that we did not score we are our algorithm only scored 35% of the papers. So essentially what happens is agents can choose to purchase an asset of will reproduce or will not reproduce, but if they don't choose to participate, then they don't choose to participate and we have no information. So, with that example, when we train. And I think we split it 8020 or something like that five full cross validation. What we found was that in 65, approximately percent of the papers are agents just didn't bid, and they just didn't have enough information. However, in the papers that they did bid, we did really well so binarizing the outcomes like just they whether they correctly assessed that the whether the price was over point five so they would reproduce or under point five would not reproduce. We had 90% accuracy. What we do see sort of array of hope. Oh, and this is incorrect on the side was a copy paste but when we evaluate ourselves against the TA to data 2400% of the papers, the system scores over 50%. So this system score 68 of 192 is just what was above but I don't have the exact number but it's over 50% so still half only of the papers, but our we attribute this to the additional training data so we, we know that more training data allows our system to assess more test points and more agents to participate but actively and we know that when our agents do participate we do very well. But the question of how we can meaningfully incentivize agents to participate or how we can allow our agents to better generalize is an open question. We have a UI, and we're doing some usability testing. So the prototype system is essentially a dashboard you can upload a PDF. Our feature extraction pipeline extracts features and those features get passed to our synthetic market and the number of agents participating. The distribution of the purchase shares what you know the price of a will reproduce asset over the iterations of the market are all displayed. So with respect to explainability, we do this in four in four levels. So the first is just what was the value of the asset, but then we also aggregate information about how many agents participated and where they were in feature space so the agents as you might have noticed from the the animation on the previous on the previous slide agents are all located in different spots in feature space so they care about different things. So what we do is we output for the agents that did participate in the market, what features did they care about, and how many shares they buy and so forth and at what price. And so that is sort of our way of explanation at this moment. We have been working with the RAND and MITRE as well as some DOD stakeholders to see how useful the system is what do stakeholders really want and early takeaways are that the features to the explanations, the features that we can point to that were behind the agent participation are more useful than the scores themselves and I think that that makes a lot of sense at this point and so remains a focus for us to better understand our explainability. As we move forward, we are receiving some explanations from the replicats team as well. And so what we would like to do is compare the kinds of explanations of our system that we can output from the market to the kinds of explanations that we get from TA2. I'm going to stop there. I have a couple of papers linked and want to thank you so much. Wonderful. Thank you Sarah. And I'm just recognizing the time here so there's a couple of questions I've been I've been itching to ask and I know that there's been a lot in the chat which is great. I don't know if I can ask one last question to have seven minutes left with each other before we go and it's to go a little bit farther I think I'm going to use Fiona's phrase blue sky. So there's these amazing platforms you guys are developing each of you there's the ways you've been pushing the boundaries of what we know. I'm curious if each of you just like a minute minute and a half is about all the time we have like want to go and go say like where do you see this where do you see, you know, forecasting these prediction markets synthetic or human landing and influencing the way that we do science today and the way that we kind of consume that and being as broad as you want right sciences a very agnostic tool that everyone uses. So if you're okay with that super broad question maybe we'll follow the same order if that's right so I'm going to put you on the spot. That's okay. So I guess I would say two quick things and this is having you know not thought about very long so but I am. I would say first in terms of changing you know the way we do science I would say a big thing is well how is this going to be used in papers how our editors going to think about it how are people going to refer to forecast when they're reading a paper to try to, you know, figure out what the starting point is. And then the second thing I would say apart from how it changes the interpretation of results. In the long long run, and we're nowhere near that right now I do hope that people start to look at forecast more in the absence of proper data from, you know, our CT or whatnot like, I do think that you know there are loads of situations where we have all of the information that we would want to make a decision and we have very limited evidence, and it would be a shame to put absolutely zero weight on forecasts. So I hope that we learn enough about when people are getting it right to be able to start to devise it and start to use it as a non zero source of information. And things yeah it's like a tool, another thing in the toolbox that we should be using Fiona, do you want to give us your thoughts on this. Yeah, I mean I feel like I sort of did already talking about how this might be how something like a replicas protocol for example might be integrated in peer review and we're actually starting at our first kind of proof of concept trial of that with a journal. That's where we're going to attempt to use this in the wild, you know for reviews in real time on real papers. So we'll see how that goes I can perhaps report back on that next year or something. But, but aside from that one of the things that I'm really excited about is actually something that Eva mentioned in her talk and that is a feature of the social science platform that she's been talking about which is using these kinds of predictions to form judgments of prior probabilities so they become incorporated a natural part of how we practice science and incorporated in the models that we produce. Excellent, thank you. You want to take a shot. Yeah, so I mean, a bit inspired by what we have and Fiona have already said and what Sarah said earlier so I guess, you want to figure out how we can get as good predictions as possible, be it with surveys or markets or the replicas format or artificial markets whatever it is and then I think we should add them to the peer review process similar to what Fiona is saying. I mean, I've been asking a few editors about hey what about having prediction market as the fourth reviewer, because I think the peer review process is a small end problem to logic sense, but to no luck so far, but maybe one day there will be who knows. And then like, as similar to what I was saying about the decision markets as some tool to choose what to do, what to replicate what studies should we be running. If there is disagreement, can we use markets help us deciding where to go for something on those. Excellent. Yeah, Sarah I'm curious to hear your thoughts on this especially since it's a broader tool that can get dispersed quite applicable kind of fast, a lot faster than the human approaches. I think there's a lot of promise and I think it's obviously I'm, I'm someone who loves machine learning and I like algorithms and all of it but I actually do worry that taken out of context, it's just, we don't have enough signal, none of these approaches are doing that really well right now, honestly and I think the problem is so nuanced so difficult that like, I think, you know, I'm seeing more and more that whatever we output as a single score. Probably doesn't tell the whole story. And I worry about outliers and I worry about you know even an algorithm is doing, I'm so proud of our 90% on our little subset, but that still leaves 10% behind, you know and I think. These are these are this is research and these are researchers and so we want to be sure that we're, it's equitable what's with what our systems outputting. So I think having it be a piece and one other piece of information that we can use is critical. Yeah, I think all of you are saying kind of the same thing right is this really exciting potential start to get it into the wild start to continue to test it. Start to see where the boundaries are there's a lot of the questions that I saw that was being asked in the Q&A and I wish we had more time. I think this is a really exciting topic and it'll be really exciting to see, you know, next year, the year after that kind of where all of this is going because it sounds like there's some really exciting work that you guys are all doing. So I'll finish since we're out of time to say thank you again. Thank you a lot for your time for your insight and thank you to the audience for the questions and engagement. Thanks team.