 Okay, I think we are live, so to speak. So welcome everybody to this discussion panel. Many perspectives on many analyst studies. My name is Nate Bresnau. I am a postdoctoral researcher in social science at the University of Bremen. I'm joined by Balazs, Rotem, Wilson and Eric. I'll let each of them kind of introduce themselves the first time they speak so they can say all the nice things about themselves. And yeah, I just want to give a quick introduction for any of you who are sitting in the audience who are not aware of what this many analyst studies thing is. And it's basically put it's a type of study where there's one or a couple of specific hypotheses being tested usually with the same data set. And it's given to many researchers what has often referred to as many analysts to see if they can test different hypotheses. And there's a lot of variations on these studies that we're going to talk about in terms of ranking other participants models or prediction markets and several other things. And so as of right now, and we'll share a link to this, Balazs has actually developed a list of all the known many analysts studies. And you could also add to it if you like. There are not so many. There are maybe 10 or 12 out there. We're not 100% sure if we've got them all. But it's it's in the past, especially five years. This has become a thing, if you will. And in many ways, if I'm not mistaken, it was really kicked off by this famous soccer study or football in probably a majority of the world, which tested if if referees were biased in their giving of cards to darker skin players. And Eric was actually part of that study. And so I think a nice way to kick this off is to have Eric tell us a bit about how that study came about and what that was like. So if you'd like to kick us off, Eric. Absolutely, my pleasure. Thanks, everyone, for being here. You know, welcome. We welcome from where, you know, wherever you are. I'm Eric Oman. I'm an associate prop of organizational organizational behavior here inside. And I was indeed one of the many, many authors, many collaborators on on the referee red cards paper. Then they were first told, I should mention, you know, there are kind of precursors, you know, kind of in the in the kind of crowdsourcing literature and our application to kind of many analysis was was a little bit of a twist on, on, on, on password. I also really want to credit a Raphael Silverson, then PhD student, who really who really drove the project forward and Dan Martin and Brian Nosek, also other members of the of the of the core team. And then also, of course, are many, many collaborators without whom none of it none of it would have been would have been possible. But the story of the referee red cards paper actually starts with with an earlier small teams paper. This is an article that Raphael and I did when he was a student and I was and I was a professor where, well, at least we thought we had found in a large sample of from Germany that individuals who had a noble sounding name, so in German, their name kind of implies, you know, it means like Duke or King or applies aristocracy that such individuals were more likely to hold managerial positions. So, you know, and we thought this was some evidence at least of some sort of maybe status halo effect, for example, you know, so he published a paper as a small teams paper. And then our our data was requested by Yuri Simonson, of course, you know, very well known, very well known, meta scientist and scientist. And Yuri did a reanalysis of our data. He essentially had a better way of collecting the data, collected much more data, and then also had a better specification, a different analysis, but there was also a better. We ultimately agree. And then he found like nothing like not a zilch, like in fact, if anything, the effect is slightly in the other in the other direction. And so we actually ended up writing a collaborative commentary with with Simonson, where we actually kind of replied to our own to our own paper. You know, so that's quite an unusual experience, I think. But it also it also led to some quite good things. So let us to kind of wonder, you know, what would happen if we took another data set we were planning on using for a paper and, you know, we gave it to somebody else to analyze, or maybe many someone else's to analyze. Right. So we ended up distributing this referee red card data set, which we had been planning to use for a project to 29 teams of analysts around the world, who at least initially independently analyzed it without knowing what each other was doing. Later, they were allowed to talk, but at the beginning they weren't. And then, you know, turned in their specifications and results, right? And you know, and the finding was that a different, different analysts, you know, in good faith, without much of an incentive to find anything one way or the other, chose different specifications and got really quite, really quite, quite different results, right? You know, so this then, you know, became a bit of a line of research for us. And then it's exciting to see other folks interested enough in this, you know, to go through the effort, you know, to both folks here in the panel and then here in spirit, right? You know, collaborators around the world, you know, kind of picking this up and looking at, you know, is this, is this itself a robust phenomenon, right? The tendency for many analysts to find, to find many different. Yeah, thanks, Eric. So it is amazing what what you and your colleagues have started. Open Pandora's box, you might say. And I would be interested to hear from Balash actually at this point, who is doing some current work, but has also published work on sort of the idea of many analyst studies and conducting them. So Balash, would you like to come in? Yes. Hello, everyone. I'm Balash Roze from Budapest, Hungary. And we do meta research, meta science research. In fact, we just changed the name of the lab to meta science lab last week. Because it was called Decisional Lab until now. We realized we just moved on to meta science so much that we have to rename ourselves. Yes. So our story is that with E.J. Bach, we were others who were developing guidance on transparent reporting. And at one point we got to multiverse and multi-analysis approach. And while I had some thoughts of how I would give advice on how to run a many analyst project, but for multi-analysts, I couldn't see how one can explain how to set up the reasonable or sensible options for each step. So I assume the listeners are familiar with multiverse in that it's enough. If one researcher takes not just one path in the analysis, but at each step of the analysis takes all the reasonable options. So this can lead to a huge number of results. If you think there are five steps in the analysis and each time you choose five parameters, there's over 3,000. If you take p-value, 3,000 p-values at the end. And it's partly hard, I think, to evaluate the results in this way. And also, how can you argue that these would be all reasonable? So our thing here was that what is a reasonable result here? And I think one definition can be that there would be a researcher taking that path. Actually, there would be a researcher saying this is, for me, the most sensible one. So a many analyst approach takes more focus to this dimension of the diversity that each analyst independently takes the path that he or she thinks that it's the most reasonable. And then when we have the collection of results, these all represent someone's best gas of how to find an answer from that day to that research question. But then I got also fascinated by the history of this many analyst approach. I don't know if everyone is familiar with that. In fact, it goes back to the 19th century, 1857, the Royal Asian Society wanted to get a new Asian, ancient, Assyric language, cuneiform script to be translated. And I think this example explains the all the benefits of this many analyst approach is that there was a linguist who had a guess of what it could mean. But then if you publish it, then how can you tell that this is the right translation? It can be anything for a non scholar to that area would have to say, I have no other options than it except what the expert says. So in fact, they asked for experts to translate independently the same script. And then they opened the envelopes at the same time and and compare the four translations and when they found that all the four translations were almost identical, then they they I can call it they say they have the truth for their basis. So so it's highly unlikely that they would get to the same translation just making it up or making some mistakes. So I think that's the whole idea about the many analyst approach. If the answers converge, then you can get more confidence that what what you have is not just an ad hoc interpretation of that data set. And when they divert and they also should be cautious interpreting the conclusions because it might depend a lot on the identity of the enemies. And that's what I think currently we don't know about social science is how much our results depend on the I've seen the analyst himself or herself. OK, this is two especially really interesting topics that I definitely want to cover about the the multiverse perspective. But I think we'll pick that one up a little later. But more to this later point, maybe I can go over to Rotem in your study. Did you expect convergence? Did you expect to find what Balazs is talking about? Because this is not what the outcomes of most of our studies have pointed to. Yeah, so I'm Rotem but Vinik Neter. I'm a postdoc at Dartmouth College, a neuroscience. And I was part of the organizing team of Narps, the neuroimaging replication, the neuroimaging analysis, replication and prediction study. And yeah, what we what we did was we collected a new data set of FMRI with 109 participants and we distributed it to 70 analysis teams. And we got their results for nine different hypotheses and like Nate already said, we got really diverse answers from this. And I think one of our reviewers actually said it the best. He said that it was the results were both struggling and not surprising at the same time. So I think, yes, I kind of expected I couldn't affect it like I couldn't bias the results because I wasn't part of the analysis. The organizing team wasn't part of the analysis. And well, I think just because we already know about a lot of things like peaking and like, you know, different, there are so many analytical choices that you can make mainly in MRA, not mainly, but many, I mean, many complex data with a lot of dimensions and FMRI, it's one of them. I think for anyone who's doing it, it's actually kind of obvious that there are a lot of points where you can make different decisions. And I think it makes sense that it would lead to different outcomes. Also, we saw Eric and colleagues paper and we knew that in other fields, this is the case. So I think I wasn't too surprised, but I was like seeing it, you know, just watching 70 teams, like all of our teams just made different. No two teams chose the same pipeline, the exact same part. It was crazy. I didn't expect that. I have to admit there aren't like, you know, common standards that everyone followed. So I wasn't expecting all the teams to do the same. But I was expecting some clusters of team to do very similar things. So I was surprised by that and I was surprised by how different the results have been, I have to say. Yeah, and I think we can learn a lot from this topic, like Balaz said, it's not the same like a multiverse analysis. You just, you know, a single analyst can just run many pipelines and see what happens, but it's not the same as making sure they are all reasonable and would have been chosen by researchers. And also this way we can see how the field really works, right? What's happening in the literature because this is what people are using. And I think we really need to find solutions in ways on how to make results reliable to this analytical variability and multiverse, as Balaz said, is one of the promising stuff. And with more studies like we did and everything like everyone here did and other people, we can learn if this is a problem in all fields or what's the problem, if the problem is in the research question or in the methods or in, you know, random, like your study nature's kind of. And and then we can try to find solutions and point if we want to do a multiverse analysis, we need to point people to which pipelines they need to run because it's not reasonable to just run all the million different things you could do. Some of them don't make sense. So I think we can really use this kind of studies to learn about the problem and point us to the solutions, which I'm hoping we could start direct ourselves to because I really like I don't know about all of you. But when I read a paper, I keep thinking, OK, but what if they would have done something else? You know, would it be the same? And I think it's very important to know that. Yeah. So it was a great experience. And I learned a lot from it. And I hope the field did the imaging field. And like going back to your question, yeah, I was surprised and not surprised at the same time, actually. Yeah. Yeah. And maybe I send the next question over to Wilson and that as well. That's actually somewhat the same question. And that is, I mean, Balaj said something which I believed going into my study, our study that we did, which was we're going to find certain areas of agreement that will really tell us something about the hypothesis we're testing. Didn't happen at all. If we look across all the studies which are in the list in the Google Drive document, this also basically didn't happen. So then it starts to feel like, OK, you know, it's not a huge end sample, right? This is maybe 10 or 12 studies that have done this. But they all are somewhat similar in the sense of kind of going in every direction. And so what what can we really learn? Or have we, you know, are we yet to learn what we're really hoping to learn? That is substantively. Well, thanks, Nate, first for gathering us all. You know, the first mover is always the most difficult one. And you by individually pulling us all together. Like I think we're very grateful for that. Well, I'm Wilson, I'm a researcher at NCR and my research dives into the areas of negotiation, culture, morality, and of course, metascience. Well, to kickstart, I'll say that this field of many analysts is growing. And a good thing to note is that metascience and many, many analyst papers are gaining more traction, right? There was a paper by Martin Svinsberg, which includes Eric Orman and myself. It was recently published at OVH DP and has recently been on their most downloaded page. So this also shows that this field is gaining traction and this type of methodology and science is growing as the years go by. And it's something that I will urge everyone to be involved at some point of their career in terms of the learning points. There are so many. But one of the key things is that we spoke a lot about the multiverse, but we should only do it if it's very viable because like not everyone wanted to like, let's say a very big project or we are all like we all have institutional rules and things that we have to like abide by. So another way you can do it is to pre-register certain robustness checks and of course have many sensitivity analysis that can sort of like help make our science better. We can seek opinions from expert leaders, statisticians for their statistics, research designs, hypothesis, all my fellow peers and resume the high confidence for consistent inferences that we have. We can focus on more empirical manipulations during our analysis and opposed to latent contracts that sometimes are not observable in our research and they also should communicate a certain level of uncertainty in their probability distribution as well. Additionally, there was a paper that we sort of had at Behaviour and Brain Sciences. It's a commentary on Yakonis general ability crisis. So well, there's a lot of high amount of research degree degrees of freedoms and of course, a substantial amount of ways that a researcher can choose and analyze their data set. So strategies that we want to draw inferences from heterogeneous set of approaches includes aggregation and passing, which is something that we sort of want to talk about. So in terms of aggregation, we are focusing on method analyzing the effect size obtained by different investigators. All of us have many analysts that analyze it, but if we look at them together, we can sort of aggregate them together and of course pass them. Right. Passing them involves using a perspectivistic approach, attempting to identify more theoretical meaningful moderators and explain variability in some kind in hopes to create a better science. And that's somewhat like what this Auspurg and Brutal paper has done with the original soccer study, right, Eric? I don't know if you want to talk about that a little bit or parsing. Yeah, yeah, yeah, yeah, absolutely. So as Wilson notes, you know, on the one hand, you could say, well, let's just average across everybody's approach. We've done that in crowd projects, both with analyses, right? Specifications in the complex data set and with experimental data. So different people design different experiments. It has the same hypothesis. One thing you could do is aggregate across them and kind of take that as, you know, maybe that's maybe that's our most reliable estimate. But, you know, as Wilson notes, you could also, you know, in the perspective of spirit, right, you know, with the great, you know, the great Bill Maguire, you could say, well, maybe we can find meaningful moderators that explain why some analyses or some designs or some approaches get a certain get a certain estimate and others and others get different ones. As Nate notes, that's been a bit frustrating so far. So even if you try to get a lot of potential explanatory variables and try to, you know, capture variance in the dispersion in estimates across different approaches, that can be that the yield is in great rates, usually a small minority of the variance. But I don't that doesn't mean I don't think that means that we should give up and there's all kinds of things that we could use. So, you know, it could be data preprocessing steps for some kinds of complex data. I could be tricep statistical approach. I could be covariates. We found some evidence and so wasn't at all that covariates matter. Martin Shrinesberg and colleagues among the among them, Wilson, as Wilson mentioned, we have a paper finding that operationalizations of variables matter. So how you choose to operationalize your your ID and your DD does explain a bit of does explain a significant amount of the variance. And another really cool one is potentially, or I think very likely, Osberg and Brutal point out that it might also be the way in which people are kind of thinking about the research correction. Are you looking at, you know, is there a direct causal effect of player of player skin tone on on on referee red card decisions? Are you are you trying to say, you know, that the player skin tone explains a unique variance, you know, despite everything else, you know, in terms of variance explained or does explain variance above all the other variables. So there's different kinds of ways in which you could be thinking about this. So, you know, you can even kind of have a branch of, you know, the garden of forking password. There's a motivating issue. You know, is there is there is the racial bias in society? Something very general. Then there's a conceptual research question. You know, is there a relationship between player skin tone and referee decisions in particular, which might be red cards? And then from there, you can have more of an empirical research question, right? Something more specific, like, you know, is there a direct cause evidence of a direct causal effect of skin tone on on red cards? And then you could have the different variable operationalizations. How do you measure? How do you measure or do you use yellow cards or red cards? Right. Or how do you operationalize status? You know, in Shrinesburg, in Shrinesburg at all. Then you have to choose covariates. You have to choose, you know, there's so many different kinds of choices. But maybe if we kind of kind of map out that full model of all the choice points and put them all together, we can try to identify these meaningful moderators in the perspective spirit. And that would be more the parsing approach. I'll try to find meaningful moderators rather than just aggregate across everything. But I think that's a really exciting future direction for this line of research for this society. And so maybe for any of the other panelists, the question then that rises, at least for me, is do we need the many analysts then to give us? This is what the analyst decisions might be. And then we could simulate them in the multiverse. Or do I mean, it's obviously a lot less resource intensive to just go to the multiverse in the first place. Maybe one of the other three of you has something to comment, Balazs, maybe? Yes. So I think it reaches an important question that partly do we really need many analyst projects should all empirical projects have independent analysts? And so but also you mentioned that those published multi analyst projects don't show converging results. But therefore, I think that right now, the big question in this area of meta science is that is it is it a general trend or or maybe those projects picked kind of data sets in order to demonstrate that that analysts can diverge? So we don't know how much it represents the field, how much it is a general phenomenon in social sciences. Right now with the Center for Open Science, we are organizing a big, quite big project where we will have multi analyst analysis for one hundred published, the data of one hundred already published papers from from all areas of social sciences that will be hundreds of analysts. So that that's quite big. But basically you need this kind of information to have some intuition about the field and also what if what if we find divergent results? And what would that show? I think if we don't know that. But if we get it, then I think we are in a very serious case in which means that so far, I think in the scientific reform, we believe that we identified certain issues and such as fraud and biases, questionable practices and so on. And I think the hidden promise that if we get rid of them, then we can trust the results. But here comes the big issue with the many analyst projects show that we call it analytical sensitivity or robustness question, whether even if a researcher can take all the steps carefully and have a well-powered sample for that design and to have the most advanced statistical approach and so on and doing everything transparently and openly and pre-registered. And the peer reviewer will say, yes, all the steps were legit. And so it's really published. But still the conclusion can be quite arbitrary because another analyst can take another legitimate path in that we call it analytical space and can come to a different conclusion. Therefore, if it is a general phenomenon, we have to question the whole empirical published literature, whether even if they give the impression of legitimate results and legitimate conclusions, these conclusions might have only a loose connection to the data that were gathered. But it all depends on how much they are dependent or contingent on the analyst person. And also, I think the other issue comes up is that why would, as you say, why not just do multi-verse and let the analysts to take these steps, but as the saying goes, that most studies have the data analyzed by the first person, the most biased person is the author who has investments in that theory or hypothesis and have certain idea of what she wants to see. And therefore, I think we're inviting independent analysts to invite people who are less involved in that question and less biased, even if it's not a conscious bias, sometimes just being in the field or having papers published with certain results pushes you towards certain steps. So I can't see how you can bypass this kind of bias by allowing the researcher to take all the options. As a reader, you will still can have doubts whether these were somehow led by the analyst's interest. So I think whether we are dealing with a huge issue is still a question. But if it is, then we will have to have some general approach to deal with that. Otherwise, we are lying to ourselves that we believe in published conclusions when they might be extremely contingent on the concrete analytical path that the analysts took. So if you ask first, I didn't answer. Yes, we developed a guidance by which we try to simplify the task. Anyone who tries to run a multiverse analysis, I hope that soon it will be published, we're pretty much available that we give templates and very practical steps of how to do that. Sometimes just employing one or two other analysts to the project already gives them an idea about the robustness of the results. So we hope that it will become more general. And I think it could be even expected for those kind of results which have strong or important effects on clinical practice or some policy making. And so. OK, that's great to hear. And if you have a link to a preprint or anything, I'm sure the audience would would be definitely interested in that. And I want to come to the questions that have been posed in the Q&A. But before I do, I want to just check that Wilson or Rotem, if you wanted to come in on that same topic first and before we go. Yeah, I can. First of all, I agree with almost everything that I've said. And I think you also spoke to one of the first questions in the Q&A because one thing is really problematic about the analytical variability is that even if you pre-register and if you share your data and you share your code, it's still a problem because you still someone needs to take it and test another pipeline to make sure it's robust or do something else to make sure it's robust. So I think one problem here is that the great new practices that we're trying to implement are just not enough if this is an actual problem. I believe it is, but we need more papers and products to show that. And another thing I want to touch on is that. So I'm trying to think about the future and what's practical. And it doesn't seem like it's practical to expect every single study to have multi analysts. So trying to think about what maybe should happen or could happen is that maybe within each field, we could have multi-analyst projects like we started to have and trying to quantify the problem, make sure, see if we have it or not. And if we do learn about it and try to see which steps are very different between people or which derived the most variability. And then we could, based on this, we could like start to focus on multiverse analysis or something else that would use this information to actually build like new solutions, a new way to do research in a single lab level. So, for example, in FMRI, you know, each pipeline takes a lot of time. It's really demanding computationally. So I wouldn't expect every lab to be able to do like 1000 different pipeline. But if we can focus on like 10 different pipelines that could work, maybe we could build a tool that will be efficient to do it and then people could actually do it in a single lab at a single study level. And also one thing I think we didn't touch upon on multi-analyst projects, which I really felt in our study is that once the community, like the field, is really engaged and many people are involved, it helps for the conclusions to actually, you know, get to the field. Because many people are there, many people are part of it. You know the problem, you feel part of it. And I think it's really important to like motivate an entire field to actually change their practices because trying to do a multi-verse analysis like as a standard practice is not right. It's not so easy, but if you're motivated to do it, it will be easy. So I really think that this was part of the like one of the most amazing things in the study that like I felt in our studies that like people will engage, put a lot of effort in it and really cared about it. And I think it really helped for it to make an impact on the field. Yeah, thanks. And I'll ask Wilson to comment on that question. But first, let me say something before I forget. If any of you out there is thinking about doing a many-analyst study, please talk to somebody who has already done a many-analyst study or, ideally, everybody possible who has already done a many-analyst study because especially in our study, for example, there were so many mistakes or just steps in hindsight where we're like, oh, my gosh, we really should have done that differently. For example, we had the analysts pre-register their models with us, but the pre-registration was basically here's a blank sheet, write down what your model is going to be. So there was no standardization and we were unable to use these because some did say which estimator they would use, others did not. And you see what I mean? So there are so many steps in the process that you might not have thought of. And because it's so resource intensive, it's so resource incentive to do these. You want to really have a strong research design. So that's my plug for you and Wilson, take it away. Yeah, and to also add on to what Nate just said, like I think Balash has this great system. I mean, everyone is a co-contributor to how to design it, the problems we face from the other projects. So it would be sort of a good stepping stone if you want to start your own many-analyst project or to even just get involved with some of us. I think one of the things that Rotten also spoke about in future directions is the idea of incentivization. So I feel that sometimes like not all of us want to make a crowd project. You know, sometimes we want to have small auto strings. I mean, it might not be the right time for us. So, you know, of course, crowd science is an option to create like many authors to get other people involved. However, not all the time that this might be the most viable choice. So I mean, you can also think of doing something like a hackathon, right? So instead of having them as co-authors to incentivize them, you can also create incentives like on Kaggle where you have a rich group of people with the expertise, you know, conducting a hackathon, conducting, controlling various variables like team distribution, time spent and expertise. And these are ways you can sort of make everything more robust for yourself. And you also put additional checks on your science and your research as well. And OK, I think we can, in a way, maybe we have responded to Chet's Q&A question. And thanks, by the way, for putting this really interesting link in the chat box. And Ramey, if I said the name correctly, also posted something, which if I understand the question is about individuals own perceptual interpretations or experiences of things or reality. And at least for me and our study, this was this was a really challenging point. We had given them the hypothesis that immigration reduces support for social policy. And what happened was many and we gave them the data and there were like six questions from this international social survey program. And what happened was some people really their their perception was that there were actually two different attitudes at play. And for these teams, they of course created two attitudes sort of latent constructs and put them out of the data. And the criticism we just got in our in our most recent rejection was that because the researchers themselves had perceived different different attitudes at play, they were actually testing different hypotheses. So even though they had this kind of same very general hypothesis and the same start data, that because of their perceptions or their own concept, that they were actually testing different hypotheses and that this sort of undermined the whole the whole thing, you know, and this brings up this interesting question like the link between hypotheses and models, you know, are these are these analytical, you know, freedoms, these garden of forking path decisions. Are these really degrees of research or freedom or are we wandering into different hypotheses and is that part of the thing that is mucking up the results, so to so to speak? I don't know if that answers your question directly. Rami, I'm not familiar with many of the terms in this. I don't know if any of the other panels, panelists are. OK, chat has said my question was answered, yes. OK, but if there's no direct comment on that, we could come back to this multiverse analysis, because this is really the a really important point and that is are we simulated, you know, can we simulate what people would really do? And again, you know, a multiverse is a single workflow, right? It's a single statistical software. It's led by a single or a team of researchers and it's very, you know, contained, it's very cut and dried, whereas the reality of research is they may or may not use that type of software, you know, they may or may not make these different types of decisions and have these different aspects. I mean, one thing in the first phase of our study was that people using Stata were more likely to reproduce a result from an original study using Stata, even though they didn't know the original study was using Stata, for example. So. Let's bring let's come back to that topic. I can call on one of you, but I bet you all have something to say on that topic, if anyone wants to jump in, Eric. Sure, sure. No, no, I think I think it's a great point. And it actually kind of harkens back to some of the comments. You know, our other panel or other panelists is made, right? And as Wilson pointed out, you know, it's and it's not realistic for every paper to be to be a crowd science project. Definitely, us as organizers of these projects have definitely we definitely couldn't possibly agree more. Not every study should be should be crowdsourced, right? I think, you know, that kind of an effort is most justified when, you know, there's really high theoretical stakes or really high practical stakes, right? Something with public policy implications, you know, for example, you might be justified in using a crowd approach. But I think most investigations will remain small teams and, you know, and should, right? Because that's a more efficient way to introduce, you know, initial sometimes pretty good initial evidence for an idea. And then only, I think, the most really high stakes unresolved questions really kind of are worthy of the of the crowd approach. But still some of the same ideas, some of that same spirit, right? Of wondering how many analyses, if not many analysts, you could still see that kind of being introduced more and more into into small teams, which I think you already are, it could be multiverse, right? Where, you know, one analyst lays out all the different specifications that she can kind of imagine, right? And it kind of crosses those those choice points. It could also be, you know, a lot of pre-registered robustness checks, not going so far as a multiverse, but really, you know, pre-registering a bunch of different proper, proper tests of the analysis in your eyes. I think it's also worth noting that, you know, the crowd and multiverse are, so the crowd analysis is both less and then more, more and less than the than the multiverse, right? So the multiverse captures all the flexible specifications from the viewpoint of somebody usually high in expertise, right? Whereas the crowd analysis really gets at a different question, which is, what would the results look? What would the analysis look like? What would the results look like if another researcher analyzed it? You know, and these are kind of different questions, right? Meaning that there might be some specifications that, you know, a top expert wouldn't have picked for the multiverse, but that a crowd analyst might pick and that might pass peer review and get into the literature, right? Because, you know, not everything in the literature is the most solid, is the most optimized analysis, analysis possible. You have a greater diversity of opinions, right? And, you know, any priori beliefs, you know, so there's value to both, I think, but there's different value. One of them certainly is much easier to do, you know, if you're relatively, if you're relatively more, a more resource constrained. In addition, I think you can also think of blending the two. So in the paper that Wilson was on, also led by our colleague, Martin Shrinesberg. What we did was a multiverse that was based on the choices of the crowd analysts. So first we did the crowd project using a complex dataset on scientific debates, you know, we looked at effects of gender and status on, you know, who participates in scientific conversations. And then the we followed that by doing a multiverse that was based not on all the defensible specifications, but on the choice points that were used by the crowd analysts, right? And this can help us try that. And that was our approach to the parsing problem, try to explain how it is. Why is it that the crowd analysts are getting different choices? Well, let's cross all of them and try to kind of be composite as best as, you know, as best as we can. So I think they're both hugely valuable in different kinds of situations. One's more scalable, but I think they both have that. Yeah, the scalable thing is, I think, a really important topic now. There's a hand raised, but I'm not sure if the attendees can actually ask questions. So, OK, we've got we've got you live, Adam, if you want to ask your question. I don't know if you're if that was intentional with the hand raising. Adam, are you with us? I sorry, that was intentional, but it was a signal to the host of the session rather than a question. So I will now lower my hand again. Sorry for the interruption. Oh, no problem. If you come up with a question, feel free to ask it at any point and anyone else in the audience as well. OK, so we've covered a lot of ground. So one other concern that I personally have, which I'll introduce is in our study, we were able to identify over a hundred specifications that different teams made and that included things like what software they use, if you can call that a specification. So given that we would have needed without any interactions or anything, probably a thousand teams to really reliably recover, you know, to not just have empty cells everywhere. You know, when we try to explain what specifications led to what outcome. So essentially, I mean, this is highly problematic, right? We don't actually have enough. This is also where the multiverse could come in to try to fill some of the ground. But then again, we have the the issues which have just been discussed. So and I know it like FMRI data, if you if you want to get nitty gritty enough, you could have it be so complex as well. Again, I'm not the expert, but you could have so many features to the data that the specifications just blow up exponentially, right? And so then the question is again about scalability. And I don't know if any of the panelists want to come in on that topic specifically. Yes, if I may. So I think what is scary is that the complexity of reality and the complexity of these questions that we're dealing with, that they were complex enough before multi analysts from a multi analyst approach, we looked at them. But if you look at them from that perspective and you see how wide the analytical space can be, then then it gets gets quite scary because as you mentioned, you would need so many teams to understand one single question. So the I think the real question is then then what to do, what to do then? Should we just say, all right, there's too too much effort, then let's just forget about it and let's have one answer to that question, triad, one analytical path or stop thinking about maybe we are too ambitious with our designs and research questions and we try to squeeze into one research study. So very big question on some very complicated and complex data set. So maybe nowadays it's a general feeling or attitude that you need certain sample size for answering a question in one study. So maybe in the future, it will be not just the sample size, but also the match between the complexity of the analytical space and the effort that you put in. So if it's a very complex and wide analytical space, then maybe you should go back to the drawing desk and redesign the experiment, simplify or work on the theory and have a more precise description of your concepts. And so or put more effort in exploring the robustness of the analysis. But I think the worst option is just to say that that would be too much effort. So let's just pick one answer and then make ourselves believe that that's the answer to that research question. I think one thing that what Valacius said brings up is the current truth thing issue, because for example, many of our many analysts study, we don't know the current truth, right? We just know that people got different results. We don't know which pipeline got the correct ones, right? Because we just don't know what the correct one is. So maybe multiverse could also be used to try to find the right pipe. I don't think they're going to be a right pipe, a single one. But probably there are better ones to give in situation. Maybe we could use this to try to find this, identify these ones. And then use them across the field in multiverse analysis for single studies. It's hard because it depends on hypotheses and depends on so many things and the data and so many things. But maybe we could try to limit the space that way, like try to also have the ground truth, because the fact that many even experts did something doesn't necessarily mean it's the right thing to do, right? Or got us the right question. So I think this is another aspect that we should think about. And it's harder to have the ground truth in multi-analysts, because if there is a ground truth, usually, you know, the other researchers could also know about it or realize something is, you know, if you simulate data, they could know something about it. And so maybe this would be easier with just a multiverse analysis on that data where we know the simulated data of many different data sets that are simulated and try to, you know, point us to the right direction, which pipelines are the best or the optimal for a specific thing. So I think this is another thing that we could try to focus on in the future. And that that is probably a bit outside of our panel, but it really is a sort of, you know, it's like writing scientific realism on a baseball bat and smacking, you know, us all across the face with it, because it's like as as, you know, often qualitative, those using qualitative methods often point out as like, you can't cross the same river twice, so to speak. And it's like I was really surprised to see how, for example, in our study, the hypotheses also went in different directions. So you would have researchers that had roughly the same evidence, empirical evidence, but one would say it's not testable. And the other would say, oh, this is support of the hypothesis. And this was really like, yeah, it really calls into question this this this grand truth, you know, whether it exists or not not being the point. The point is, you know, it calls into question us as, you know, the subjective side of us, you know, in the process of research and how much this can really, you know, blur or mix up the results. Somewhere, I think there are certain constructivists who are cheering for our work and I don't think that's a bad thing. Yeah, Eric, sorry. Sure, sure. So I had a couple of thoughts, a couple of thoughts there. So I'll make kind of two, I'll contradict myself a little bit, but I'll say I'll kind of make two points. So the first would say, you know, on the one hand, you could say these many analyst results are really quite, you know, disturbing in that, you know, as Rottam and others have pointed out, even in the absence of p-hacking or any incentive to get a particular result. And even if you were to pre-register, different analysts in good faith will choose different approaches and get and get different results. That said, though, you know, we have that, you know, wonderful compilation right in the link in the chat. We do have to also have a relatively small sample of studies. So we also have to be careful of this and also careful of selection biases. Folks might tend to run many analyst studies on data sets. They anticipate being complex enough for that to be worth it. You know, if so, then we might have over sampled relatively more complex data sets, although it's also worth noting there's data sets way more complex than the ones than the ones that we've used. It's also worth noting that I'd be very I'm super excited to see what multi 100, right up a lot. This project turns out when you sample more systematically, how big is the spread of dispersion of estimates across different analysts, you know, for the test and the same hypothesis on the same data. It is worth noting that the subsequent papers have found more certain more dispersion of estimates than the original referee red card paper. So that's the point about saying that, you know, maybe there's a lot of dispersion and it could be it could be very problematic. On the other hand, though, you know, we do need to be definitely, you know, need to be conservative and do more projects before drawing any strong conclusions. Also, I also hold out hope that the parsing problem can be, you know, can be addressed and we can maybe find meaningful moderates that explain it. And then the key might be, as Osberg and Rudol point out, and actually where we're currently talking about maybe an adversarial collaboration together, maybe part of the answer is going further up in the tree of decisions, right? So if you go up to the point of how are you interpreting or thinking about the research question? How are you formulating your hypothesis? That might actually be really important. And maybe if we pin that down and then we also pin down meaningful operation, operationalizations of variable. So how you think how you define status is not an arbitrary thing. If you use, you know, that somebody's institutional ranking versus their personal citations, those are meaningful, meaningfully different. That's not a random choice. Maybe if we kind of get really specific in these different points in the tree and go all the way up to the theorizing, we can actually potentially say, yes, there is an answer to a question when asked in a very, in a very specific way. And if that specificity is drawn from theory, then we kind of do have an answer potentially even across even across multiple analysis. So on the one hand, I think it's scary how many, how different the results can be, even with a lack of perverse incentives and say, even with pre-registration. On the other hand, I do hold out hope that, you know, in the perspective of spirit, we can identify meaningful moderators and find these pockets of coherence that our answers to specific theoretically meaningful questions. So maybe to bring Wilson back in, if you like, so the idea, if I could summarize this almost to crowdsource theory as a, as a, as a first step, right? And would that be a realistic way forward for you? What do you think about that, Wilson? Well, I think that this field is still relatively new. We are still approaching it with different methods. We're trying to see what is the best way that we can do things. And well, as you always think about many analyst projects, we always focus on the statistics. So we need to remember that our statistical approaches, whether this or that are models for our thinking. And it can only tell us so much about the data that we have. And we should evaluate it with the limitations of our approaches, rather than fully believing that our single way of thinking or even our single way of doing it, it's correct. And we also remember that as we see in a lot of our papers, that a lot of times when we ask analysts on why they choose a certain method, they choose it because that's the most comfortable method they have. You know, it might not be the method that is most correct. It might not be the method that is most applicable, but it's something that, you know, we feel comfortable doing and that we have done for many years in our respective fields, be it psychology, sociology, or whatsoever. Right? So besides that, I think approaching it with the theoretical approach, you know, approaching it with collecting clean data is a good research designs, other ways we can sort of start thinking about it. And of course, to get involved in more meta scientific and many analyst projects to sort of pave the way on how you want to design your study and also to approach other studies as well. Great. Yeah, I think that points us right in a nice direction where maybe we could have some final thoughts from each of the panelists, although that has really been a nice summary of what we're doing and what we're talking about. But if there are any final thoughts from the panelists, you must not, you know, you're not required to say anything, but let's start maybe with Balazs. Yeah, I did want to reflect on what Drotun said about whether there is a right statistical path for one data set and one question. And I think that makes us feel very uncomfortable or uneasy because then question comes up, who will decide which is the right statistical analysis for a research question? And one answer can be that we should just embrace diversity and accept that there is not one right answer to a question and there can be different approaches and from different perspectives that can be different interpretations. But also, I think most of us would accept that there are certain approaches or analytical procedures, which are just mistaken sometimes and not rarely the researchers would admit it if they looked into it because analysis are complex and one can make a mistake. By the way, many analysts approach can also bring to the surface when there's a big discrepancy that maybe one can find in the code the mistake that made, but also at one point they will just never agree on certain approaches. So I think this is something we have to accept in science that while there are certain ways that may be completely wrong, we agree, but also there's a huge gray area where there won't be any authority detailing which approach is the appropriate one. Right now, since there was only one analyst and published one part and we take it as the right approach, but we have to live with the truth, I think that can be different approaches to the same question and same data set. And if they converge, then they can give us confidence to a level. But this is just not a kind of science, I think, where we will ever be really confident about our conclusions. Okay, and with five minutes left, Eric, I don't know if you have one or two minutes of comments, and then I'll pass it over to Rotem to have a final word. I think these are all great points. One thought I had was that it might be worth thinking about rather than which analysis is better. It might be worth thinking about which are the ways of thinking about the problem that are meaningfully different and can that potentially mean that if folks are essentially asking different questions or thinking about the question in a different and of way that it's almost a different or that it kind of is a different question. I think that could be a really key choice point as Osbrook and Woodrow point out. So rather than say one way of thinking about it is better than the other, we might just try to capture again in that perspective of spirit, what are the kind of major ways of thinking in which people approach a problem? And then what are the answers that fall from that? Is there coherence within those or more specific questions or is there not? Great. And Rotem, some final thoughts, maybe? Yeah, I agree with what Eric just said. And I think this was a fascinating discussion. And there are many perspectives on this, as we can see. But I think we also convert to some things that we think we need more studies to see how big the problem is and whether we had a bias in the current studies or not. And we need more ways to parse, like we said, to parse the process of analyzing the data and see where the variability comes from and which steps are meaningful, like Eric just said, and try to see what we could... I think the tension between what theoretically would be best because science is so complicated and there are so many things that affect the results. What would be the ideal way to do it? And what would be a practical way to do it? And I think this tension needs to be... We talked about it a bit and we really need to find a solution to this, like to find something that can actually be done in a single study at a single study level and see how we can deal with that. And yeah, I'm really excited about the new project and looking forward to see where we're going with that. Great. Well, thanks again, everyone for coming together and including the audience. And yeah, it's been a great panel. I think we're basically right in time. I assume that the chat will be accessible along with this recording because there's a lot of useful links in there. But please correct me if I'm wrong. And thanks to the MetaScience conference organizers, it's been a great conference so far. And so I think we can stop there. Yeah. Thank you. Thank Nate for moderating. Thanks a lot. Thank you, everyone. Thank you. Thanks so much, everyone. Analyst, attendees, organizers. Thanks, everybody. Whatever time zone you're in. Thanks to all the many collaborators, you know, on all these projects here in spirit. And thanks. And thank you.