 Hi, okay. Welcome everyone. Welcome to the session indeed about p-hacking and a bit more even. We have made a symposium in which we wanted to revisit questionable research practices a bit and with that I mean that a lot of research has already been done on questionable research practices on p-hacking. We wanted to give it a bit of a twist. We will be talking about multi-verse analysis. We will present SAM, which is a very exciting program and I don't want to spoil too much. We are going to talk about the theoretical model of p-hacking and finally I'm going to present some results on a simulation study conducted in SAM. We're going to need all the time that we need, so I'm going to start real quick. Our first speaker would be Anton also in call and time, but unfortunately he could not be here today. Luckily our other speaker Jelte Wigerts will take over and will present in his place. Without further ado, I present to you Professor Jelte Wigerts. Have fun. Enjoy. Yes, and I'm live. Thank you Esther for introducing us. Unfortunately Anton fell ill, but having been involved as a supervisor in this project, which I like a lot, I think I'm in quite a good position to present it. So it's about a multi-verse analysis that makes use of pretty cool data from the registered replication reports in psychology and the whole aim of the project is to get a sense of what p-hacking might look like. The pre-print is on the Psycharchive link given below and also we'll tweet about it, so if you ever want to find it, it's not too hard to find. As you might know in many studies, if not all, I would contend, would the data lend itself to a great deal of many analysis. And we call this after Simon's and his colleagues degrees of freedom in analysis. This analysis one would say is not carved in stone and actually you could carve quite a few holes in any stone through the analysis and this is called a multi-verse analysis and what we will do here is to see, okay, what does it look like if you analyze the data in many, many ways and to see what effect it has on a meta-analytic outcome. So this is the garden of forking paths, Gellman and Locan's idea. It's of course by this audience, well known, that many, many different ways that you can analyze data with could result in many, many different outcomes and that the call for prior registration of the analytic plan is one way to specify one single path down the garden and that these register application reports in psychology and other fields I guess as well really aim to pick a seminal study. They are registered reports, so they're all assessed prior to the study itself. A very strict protocol is implemented and the study is run in multiple labs. And this is nice because now we have an awful lot of data thanks to the hard efforts of all these contributing labs and the leading groups that set these registered reports up and the idea is to see what happens. Now what we did with the until did actually after he got a hold of all the data is to look at all sorts of additional information in the data files and have a pretty broad protocol that allows you to create different choices in analysis. So the slide here is not meant as an eye test, but I can tell you that these things relate to paper we've written a couple years ago on the many types of degrees of freedom that you can actually have in these types of studies. And these types of studies are typically lab studies, relatively short, where the outcome is typically the between group difference in means or standardized mean difference is the outcome. And the degrees of freedom could be anything. So something like the ages reported. And if you read in the literature and how also we set up the many different degrees of freedom that you could think of. If you read the literature you often see we discarded participants aged 25 and above. But you could also read sometimes actually some work on this dating some time ago that they would exclude some outliers on particular basis that they would exclude those who have a different native language or non-native speakers that they would also look at the outcome variable, which is often measured by skills, kick out some items from the skill and read run analysis. Now you can do that in many, many different ways. And it's a protocol that's onto the plight automated and used for all these labs within the different reputation reports. And what you can then do is you have a grid basically that's the idea from the Belgian group. And it's called the multiverse because you combine all the different dimensions. So this is one example you exclude based on age. And most of the studies are actually student samples. So then the 25 is not too far fetched. And actually if you use that as a threshold kick out of 25 you already get two options. All cases or those without the younger ages, but you could also use other completely arbitrary cutoffs. Now we have a list of very common degrees of freedom and we be content that this can be applied. They were pretty successful in all these labs, studies and a list of more idiosyncratic degrees of freedom that will really will take it and one would say more subjective and apply to the specific study at hand. So sometimes particular tasks were used and this is also very common in literature that people participants were dropped out from the analysis because they didn't do well in particular tasks. Again, lending itself, you know, to different thresholds or criteria. So then this is one of the labs from the RR4. It's on the paper which one it is. It's not of interest now, but it's quite representative of what you would do different outcomes. So this is a standardized mean difference that you could obtain when you go through all the possible options to analyze the data. Of course, assuming that these are all arbitrary choices. You could criticize for us, I guess, to say that this is not a wise choice, but yes, that's a good scientific debate, I guess. In this case, 20,000 possible outcomes depicted here and they depicted in a way that you would recognize having done meta-analysis. So you see that some go up and down, reflecting your discarding of cases, others go and move horizontally, which might be related to cases where the outcome measure changed. And over the nine different projects, so these are different topics of interest, we found between 4,000 and 1.5 million multiverses for different studies. So that's quite a lot of options, a lot of ways to find significant result, if you will. Now, and this has been done before in other contexts as well, back to Patel, and it is, for instance, they call it vibration of effects. And of course, the group in Leuven did this also, they came up with the idea. We've now seen about four years of these multiverse analyses in the literature. But that's not what is nice about this data. The nice thing about this data is that we actually have multiple labs running the same experiment. So it's also nice to see which of the different degrees of freedom, which is the different choices you could implement, make a difference in how much variation you absurd. Now, in what we call p-hacking, I would say it's basically just a selection mechanism. So what you do then as a p-hacker is just select one of those that's typically green in paper, maybe two, add a footnote there as a sensitivity analysis to shine the results up or whatever. But no way that you're going to report this, EI, all the results. And of course, in this case, and this, if you look at the whole bunch of blobs here, wouldn't say it's a big effect. And then you could, of course, select among all the significant ones, leading to clear bias. Now, we know that these are actually registered reports. So we know that one of those dots was actually chosen by the, in the protocol, that there in the experiment would not be a bias of this sort. But had these studies not been pre-registered, we could actually see how much bias they might have caused over different labs if the same study had been done. And the research had all the options to choose. So p-hacking would be something like you pick something, and it could be, you know, completely random. Who knows, you could, you could imagine all sorts of physiological biases playing a role. Hindsight bias is nice. I knew it all along effect. It's a very potent thing that even us as a group, you know, we pre-registering our studies even for pre-off. So then had some nice project of one of the PhD students. And, you know, as we pre-registered, and then we found some really cool results, and I was really enthusiastic about it. And I was, I could, you know, definitely, I was absolutely certain that this is what we had pre-registered. No way. So anyway, hindsight biases is a very strong bias, as are biases, like confirmation bias. Now, intentional would be some, some selection based on the data, which causes bias. So this is one RRR, and then you see the 16 labs here. And this is the same study here, running different labs. So you can imagine some differences here in the end, which is obvious because the scatter plots are up or higher or lower in the funnel, affecting lower, bigger, smaller, or larger sample sizes. And you could imagine also that some pick, you know, an effect to their interest on the right-hand side, others on the left, getting, giving rise to all sorts of confusion in literature. Hattie study has not been pre-registered. Now, the nice thing about this, as I said, this uses a protocol that structures the degrees of freedom, the grid search, if you will. So this is then an overview of the different degrees of freedom, different choices that what could make, so that's the marginal dimension, to see how much variation in the outcomes they actually cause. So for instance, in lab seven, it's the percent, it's select, deselecting people on, on some error percentage correct on some of the tests, that yields a lot of variation effect size. This makes a big difference. Other labs like lab nine on the left-hand side, first column, their age had a major influence. So age moderated effect, or at least you would say, naively, that's probably just random sampling anyway, given the inconsistency of all these patterns over all these different labs using the same protocol, running basic, the same study. So there is bias, that's one message, but also the bias is very unstructured and it looks like it's going in all sorts of directions, which is quite this discomforting for those who of us who would one day would like to correct for these biases either in individual studies or in meta-analysis. So it's a mess. Another important thing that we, well, not really uncovered, but as you can recognize here, you could also do it formally, is that if you look at one, one, if you look at multiverse, they have the same sample size. Now, given this, the conference interval width in this funnel, we expect if this were independent draws, you know, that the blobs will be mostly around the 95% conference interval, and they're not, and they're much narrower. Now, this makes perfect sense because you're using the same data, so this highly correlated data. So what you would expect under no true moderation, so the effect is not actually moderated, but it's just random resampling that, given that there's some positive sampling covariance between the outcomes, because it's widely overlapping data, typically positively correlated, that the standard, that the variance, or sorry, the variance or the standard deviation over multiverses would also always be smaller than the nominal standard error. So that's an important insight. Basically means that prehacking has less of an effect than you would expect under extreme levels of publication bias. And actually this is, I hope that's also in the slide of later talks. Anyway, I don't have much time, I think, let's check. Okay, I'm going to be quick on this. What happens in the meta analysis? It all depends on how people select it. So this is the typical meta analysis combining all these different effects. And the blue one here is the preregistered one. That's what would become enlisted in the meta analysis across different studies. And the nice one on the right is the most significant one. So it's the most extreme scenario, you can imagine, the selection among all the multiverses in all those studies. And this is a slide that I think I'm going to just stick with this one. This is in the paper. This is for local ratios in two of the register application reports. So this is a meta-analytic outcome that depends on basically what people select from the multiverse. And you see the blue one here. And that's actually the preregistered outcome. So what was done here, we run a meta analysis on only the preregistered outcomes. And the other ones that are shifted to the right, reflecting bias, are different selection mechanisms. So you can pick the most significant one. You can pick randomly among all the significant ones. It might reflect if you're not running all of the multiverses, but rather some haphazard, random walk through the forest, if you will, just the first one that will be significant or maybe the least significant one. There's also this blob in the middle that's actually the average of all multiverses. And the nice thing about this and the next slide is actually what we found is that yes, there will be bias if there's selection and it might be more extreme if you pick the most significant, obviously, based on significance. But the preregistered outcomes are actually quite close to the mean. And this is also picked in a smaller slide here for standardizing the difference, the mean over all multiverses. There's only two I think. There's only one SRR3, one of the outcome measures on top, where the mean of the multiverses is not very close to the pre-registered outcome. So in a way, they picked something that's weird, at least quite out of sight of the range of the other multiverses. But for all the others, the mean over all multiverses is quite close to the effect that was pre-registered. So degrees of freedom matter. They create a lot of uncertainty, absolutely, with quite complex statistical properties, as you can imagine, that also deviate a lot between different studies, even if the studies are very similar. Without pre-registration, there's no way of knowing what's going on. So the potential for bias is massive. And the selection is what matters most. And as said, here at least, but it's a selected sample, right? Most of them actually didn't show a big effect anyway. There were field applications, if you will, at least according to some. I would think they're very successful in estimating the real effect. But here at least, the average of the multiverses look very close to the mean to the pre-registered outcomes. So thanks also behalf of Anton and we wish you well. Quick recovery. Thank you so much, Jelte. We have time. I saw there were two questions already in the Q&A chat. Maybe someone else wants to ask a question. We have a bit of time for that. And otherwise, Jelte, I would ask if you opened the Q&A and answer a question. Yeah, yeah, yeah. So first one is by Jelte. The heat map is shown around slide five, an account of point estimates, correct? Wouldn't reporting an interval estimate from one analysis that would represent the effect? Yes, if the points would be independent to us, which are definitely not. So that's related to the insight that if you PAC, you would have positively overlapping data. So the standard error would also always be bigger in that case than the standard deviation over multiverses. Brian also asked a question. Excellent point. Thank you. Have you tried emulating the original finding through replication process by, for example, selecting an analysis strategy that worked from one lab and using that same analysis strategy to directly replicate the finding in the other nine labs? So that's a cool effort as well, right? So you see indeed in actual research in psychology that they just mimic what others do. So it might be, you know, the seminal study uses this haphazard choice for the outcome measure and some haphazard choice, you know, that would mean that they dropped all these over 25, I can say that, 45 since two days. And then they just mimic the same. So they would call the direct replication that sense. Now the variation over the labs that we observed in all the slides I showed you in a degree to which different degrees of freedom with different choices make a difference over labs is quite substantial. So I don't think it will matter much. It's a, yeah, it's a popery of nicely smelling weird results that we don't really understand. So thank you for this question. I hope they answered them. Yes. Thanks, Jelte. Thank you very much. I'm quickly going to move on to our next speaker. Our next speaker is Amir Abdul. He designed a program, a wonderful program of which I don't want to spoil anything. I saw his presentation earlier. It's going to be very clear. It's going to be very exciting. So Amir, please take it away. Hello everyone. So I hope everyone knew me as Esther said, I'm Amir. I'm a postdoc at Tilburg University and also jazz programmer at the University of Amsterdam. So now I'm going to introduce Sam to you. Sam is a software we developed to simulate an abstract model of producing a science and publishing science. So this is abstract. So I'm going to take everything Yilta showed in practice and I'm going to really make it abstract how the science is done and how we could do it in a computer. And if you're not trying to simulate everything one by one, we try to get as close as we can. So I'm going to start by why and then go through what Sam is, give you a bit more introduction and detail about different pieces that Sam has and allows you to do to make your own simulation. And then we're going to review and tell you how you use it and what are the stuff that you could do with it. So I'm coming from a complex system, computational science background. So when I look at the system that I want to study, I want to know all these components. I want to understand all these pieces to be able to really understand what's going on. I want to know how these pieces are interacting with each other and I want to know how they coexist. And if I remove one, if I modify one, what will happen? And when I look at the first, I look at this whole problem of QRP degrees of freedom and publication bias, my brain was just that there's just too much of stuff here and there is so little data. So what do we do? Since I'm coming from competition, so let's write a simulator. So that's what we did. I started to make Sam, and as you can see, I define it as a modular, flexible and extensible simulator to systematically study questionable research practices. There are two keywords here that I would like to point out. One is modular, the other one flexible and extensible. Modularity allows us to understand the underlying component and processes. What that means is that if I think about it, so I'm going to use a car analogy here, think about a car. If you want to change the wheel in your car, the tires, if you know that your car has a 16-inch tire, you could use another 16-inch tire and see what effect it has on the turret and the acceleration and whatnot. So that's modularity. As long as you move them around without any confusion and they work. Flexibility means that you have a precise control over customization of each component. So again, think about the wheel. You could precisely measure and change the air pressure and see what that does. And extensibility means that there's always room for you to add new methods and processes. So this is a general overview of Sam. And what Sam tries to do is try to simulate the process of conducting a research and publishing a research. So if you look at one overview of how this research is done, it has two main players. So these are simplified. There's a lot more stuff we have in our appropriate on the website, but I'm trying to simplify and give you an overview. So there are two main players. One researcher and the other one is journal. And in our setup, we have three different processes. We have preparing, performing, and publishing. So how it works is that the researcher think about what it wants to do. Okay, I have this experiment. And these are the parameters of the experiment. The population is this. I'm going to have two groups of people and I'm going to give them this treatment and see what happens. And based on those information, he performs the research and prepare the experiment, collecting data. Then he analyzes it. And if he's happy with the analysis, he prepare a manuscript, and then we go to the publishing phase that he passes the manuscript to the publisher, the journal. A journal, look at the manuscript that is coming and try to evaluate it with certain factor that it has. Like it could look at the, if it are in the case of very basic case, is it significant or not? And if it's happy with the manuscript, it publishes it and it turns into a publication. That's a very abstract overview of how research is being conducted. Now I'm going to move into all these pieces one by one and go to their sub-pieces to see what are they composed of. And as we start to add more pieces and more pieces, I'm hopeful that it makes sense that that modularity that they explain, it starts to show itself. And you could see how they come all together. So if you look at the, if you zoom out a bit more, then you will see that the experiment contains the experiment setup. The experiment setup, these are the term that I'm using. Don't try to translate them one by one to what you know, what they will make sense. So data strategy, and I go through them later. So experiment setup contains data strategy, test strategy, effective strategy. Think of experiment setup as pre-registration. You say, I'm going to have two groups, this kind of treatment and more population are going to be that. And I'm going to run this test and I'm going to calculate the effect like that. And then you collect some data. This will be your experiment. Then the researcher has access to the experiment. It runs it. It has some sort of strategy how he wants to analyze the data and how he wants to prepare his manuscript, right? And sometimes the researcher is mischievous. So it's just like, oh, it goes always for the certificate. So we could tweak these in the research strategy. And then we have set of hacking strategies like methods that the researcher might choose or might not choose. We could choose and define if researcher wants to do this or not. And when the research and analysis is done, he prepares the manuscript, he choose a journal, pass the manuscript to the journal, journal reviews it. And I put a piece meta-analysis here. It comes to get a later vice there. And after that, the manuscript is accepted. Then we have a set of publication. Now I'm going to dive even deeper into any of these and show you more and more pieces. Think about the modularity. I'm trying to break down the process of conducting a research as much as we could. So let's look at the experiment. Experiment has design parameters. So there are two groups. There are a certain number of DVs. There are a certain number of conditions. And there is a certain amount of data. Then we have data strategies. That defines where the data comes from. Is it coming from a fixed model? Is it coming from a random effect? Is it coming from a graded response model or any IRT models or whatnot? And these are the pieces that I could all put into the sand. And then we have tested strategies. How dis-experimented and running is going to be tested? How I'm going to validate if the result is valid or not? How they're significant? There are a few tests that we implemented, but remember I mentioned it's extensible. So we could always add more. We could fine-tune these because we are flexible and we could interchange them. And then we have effective strategy. A few other options here. And then I'm going to move to the researcher. So we know that the hard part is usually the researcher because researcher is the one that thinks. And when people think, things get hard. So we have experiment that researcher take controls of. And then we have the researcher strategy. He always have or she always have some sort of plan. I'm going to run this test like that. I'm going to run this analysis like that. And then I'm going to make certain decisions. And so he could make several decisions. Some are listed here. There are more. There are logical queries that you could tell the researcher. Okay. When you run your study first and collect the initial type of data, look at your data. Choose, for instance, a preregistered result and make a decision if you're happy with it or not. If you're not, maybe you want to run some of the hacking. So think about the fact that we are trying to simulate a specific path. So we have certain decision to make in certain position during the research. We could even do one decision at all the end before the submission. The researcher does everything. And then when everything is prepared, look at the manual. I was like, am I really happy with this or not? And then you could define a specific criteria and the researcher, no, I'm not really so passive by. And there are several hacking strategies we have implemented. Some common one, selected reporting, optional stopping, outlier removal, optional dropping, group pooling, question of the rounding. And you could have anything here or you could implement more. And you can fine tune these to your preferences. We understand that there's no consensus on what these are exactly. But there are a lot of parameters to get to the behavior that you would like. And after the analysis is done, we have the manuscript, which is just a piece of document that the researcher passes it to the journal. So this is about the researcher. We go one level deeper. And at the end, we have the journal. So journal is equipped with a review strategy. So the journal has to decide how he is going to process an incoming manuscript. We know that there are several criteria. It goes to the reviewer. There's several criteria that each reviewer has. We could have a journal that free-select everything. Everything was true. We could have a journal that randomly selected because we want to see what effect that has. We could define a journal that select based on the significance. So we induce some publication bias. We could define a journal that has a custom policy. So if we say, look at the effect. If the effect is medium and the significant and the researcher is that guy accepted if it's not just reject everything. Or we could have something like adaptive reviewer strategy that journal look at these continuously update a certain metric on the track of research that it has. Let's say always calculate the egg or test and try to somewhat keep the funnel plot symmetric. So if most of the results on the right side try to give a bit more weight to those that are on the left side to make the symmetry works a bit better. So kind of keep the pool of publication somewhat stable. So we have control over how to define the reviewer strategy. We have control over what the journal could calculate during this process and export trust when you want to analysis the data. And we also have a bunch of publication that journal collected based on these strategies and the simulation that is already running behavior of the researcher. At the end everything is going to be introduced pool of publication that we could take and do more analysis with. So that was all the pieces. And now there's a lot more going on here, but I want to give you a bit of very high level overview of how these processes implemented in some. There are different processes that I'm missing here, but this is just a very high level overview. We initialize the experiments. These are the parameters that we put. We generate an experiment. Then we pass it to the researcher. And we also define all the selection and decision that we want the researcher to want the researcher to make. So the experiment is clean. The researcher looks at it, makes a specific selection. Let's say pick the pre-registered result outcome and make a decision. Am I happy with this or do I want to hack? I use hack. It's a bit of a strong word, but you get what I mean. It's basically some sort of peaking. Am I happy with this or not? If he's not happy and wants to hack, then it goes to this stage. I ask itself again, do I really want to do this? You could define the parameters. And then he actually decided to do the hacking. It goes through a loop of hacking strategies that we define. Could stack it up on top of each other. Could pick up a new fresh experiment. Edit that. And at the end, collect all those results. And at the end of all the hacking strategies and adventure, look at all those pool. Select the one that he thinks is the best. And ruins the entire science and publish it and send it to journal. And journal could look at it and see if it wants it or not. And if it's significant, probably wants it. So this is a very high level overview of what's happening. There are much more, many more processes that you could adjust. And now I'm going to show you how you could do it. Sam is a C++ simulator, which means that you could use it in two ways in our case. You could use it by inputting a specific input file, the configuration file, or you could use it as a C++ program and integrate it into any other program, let's say any evolutionary algorithm program that you have. Use it as the Asian space simulation framework. There, how the configuration file looks like. So we have experiment parameters. I don't want you to really read everything. I just want you to know how you put the pieces that I mentioned together. So you define the condition, you define the dependent variable, you define the number of observations. And you have your data strategy, let's say linear model, you give it the multivariate normal, you define the mean, you define the sigma, then you define your factor strategy, then you define your test strategy. As you see, there are a lot of parameters to team. Then you define your selection strategy or research researcher strategy, how to navigate through this path of selection that you have through different stages of a simulation. And then you define the probability of being a hacker, you could have a random variable there. You could define a series of hacking strategies, how you want them to be applied, in what order you want them to be applied, and what kind of parameters you want them to have, when to stop, when to go to the next one, after you're done, what to select and what to report. So you have control over all these details. And at the end, you define the journal, let's say, run the simulation, collect 14 maximum publication based on significant and induce 95% of publication bias. And what you get at the end is an output file, let's say a CSV file, that is being generated based on the setting that you have and you could analyze it further to see what kind of biases it has been induced to your publication pool. So if you choose to use the configuration file, you have this tiny tool called Frodo, and that helps you to generate these files, configure your projects, manage your Sam's project, and even helps you to run your project in parallel, in supercomputer. So that's Frodo and that's how you do it. And so I'm going to recap here and then give you some point, what could you do with Sam? So as I said, Sam is a modular, flexible, and extensible modular because we could change the stuff very easily and the whole structure of the program stays the same, you know, we don't need to worry if they work or not, they always work. It's flexible because we have very precise control over the parameters of each module and it's extensible because we could extend it later. So if someone later wants to apply some Bayesian data analysis or Bayesian publication bias measure at the end of the journal side, it could do it. If it's a physicist and want to have some fuzzy logic, it could have some fuzzy logic to it, that's just possible. So it's a simulation framework for systematically studying researchers' questionable research practices, as well as journals' questions, but reviewing practices. Because we are able to design a specific selection mechanism for the journal, we could define different policies for journals to review incoming manuscript that has been passed by the researcher. So then we could actually understand what the journal does to some extent at abstraction and maybe try to find out different way for journals to select things that doesn't join us. And with that, I'm going to give you, since there's too much here, I'm going to give you a few examples of what are the stuff that you could do with that. So we could do study publication bias and publication bias metrics. We could study question and research practices, their individual and collective effects. We could do multiverse simulation and analysis. You could also include the QRPs and publication bias in those because they're just different modules. We could do metameter analysis simulation, again, with all those both. And we could design an experiment for journal reviewing policies, as I mentioned. We could design and evaluate different researcher strategies. Like if we know how researchers think, then maybe we could simulate a different pathway and tell them, okay, maybe if you think about this, this way it might not have been as detrimental at the end. And we could do combination of all above. And as I mentioned, it's modular. We could do as it's extensible. We could just add some other new method that we just find and see how that affects the whole system. And with that, I would like to thank you. There's a lot. And I talk too much and too fast probably, but I hope you at least get a gist of what Sam is and what it could do. Thank you very much. Thank you, Amir. We just received a question in the Q&A. I hope you can open it. Yes. I see one from, yeah. Interesting to hear more on what Sam's output file would look like. So you get several CSV file based on the parameters that you define. You could ask Sam to report all the publications that has been accepted, all the rejected publication. And that includes a lot of several statistics like the effect, p-values and number of observations. There is indication if the study has been hacked. And what degrees of what dependent variable has been reported and a lot more. If you define any meta-analysis, you get some output of that meta-analysis calculation. Can you evaluate the biases in published results with this? I'm not so sure if I understand the question correctly. What do you mean by that? In the interest of time, I'm going to ask the person who asked the question to redefine the question again and post it in the Q&A. Because we have to move on. Thank you, Amir, so much for your talk. Thank you. And I'm going to have a quick look at Yalta, whether he is ready already. And he's back by popular demand. I'm going to give the floor to you again. Thank you. Can I share? I think Amir stopped sharing. Yeah, thank you. So here I'm again. So now I'm going to show you a typical example of what you could do with some. And, well, it's not even close to what it's capable of. That's cool. So it's a bit of a teaser, if you will. Anyway, the focus of this analysis is actually it's a replication of another study by Marjan Bakker and me dating almost nine years ago, where we looked at these different biases in one particular context, again, focusing a lot on psychology types of scenarios that will be of interest there. But yeah, I'm going to imagine that these would be relevant to mobility as well. And the focus of the question running this in some was, okay, can p hackers also cheat their way through a lower alpha level? As said, there are many degrees of freedom. This is the paper I alluded to earlier for psychology studies. So this is a typical study where we're shortly after the experiment. We in the group in a medical group in Tilburg sat down at one time and just, you know, had a brainstorm session to make a list, which is published in this article, how many degrees of freedom we could actually observe. So the many choices we could make, and this was actually the basis of many of the degrees of freedom in the multiverse project by Anton is also actually used in checking pre-registrations by Marjan again in the group. And also, we're going to talk about that as you hear my five year olds next week. Now, many ways to analyze data mean many ways to reach the stars or what happens if you lower that alpha to 0.005. Now, this is the paper we tried to replicate, which was doable, which is also an interesting exercise. It started actually Marjan's recognition that there's a power paradox in psychology as it is in many other fields, is that many researchers actually continue to use very small underpowered samples. Not understanding that, they actually report an awful lot of significant results, arguably over 90% even higher, according to some very old estimates. And that cannot be true. And actually Marjan in his paper estimated the average or median power based on earlier meta-metanalysis in psychology studies that appeared in meta-analysis to be around 0.35. So one third chance getting significant, 90% significant in the reported results. No way that this can be true. Nice thing to notice that this 0.35 is quite well corroborated into other major meta-metanalysis later. So anyway, it cannot be true. That's the main essence. And how come this researcher actually continue to use small samples? And the reason is that's actually the focus of the study by Marjan that actually using small samples, multiple small samples instead of a large one, is a very effective strategy to get a significant results, particularly when the effect sizes are small or non-existent. And of course, that would also use some, please be hacking tricks or question or research practices as we use them, as we used them to earlier. So this is what it looks like. Plant analysis significant effect by paper. If not, well, maybe you can redo the analysis with some alternative outcome variable. This is called selective outcome reporting, outcome switching, well-established. In the medical sciences, randomized controlled trials, because they were pre-registered, we know. And it's very easy to do an actually 63% or so of researchers in psychology, according to anomalous surveys admitted having used this trick at some point in their career. You can also add some cases. Optional stopping, sequential testing would of course need to change your statistical procedure to accommodate that, models for this, but those are not used normally. I don't think at all, actually, sometimes, but largely. Again, you can also remove some outliers at some particular criteria. If that's then significant, you're again in a happy camper. If not, then you say it's a field study, resort to study to the fault war. Many, many different ways, many different paths. And that's actually an important point there, because this is one sequence that was actually simulated in Marianne's paper. But you can also imagine many different sequences and they make a difference. They might make a difference at times. And it's actually what we also encountered when Amir replicated that in my study. So then the first thing is, this is your road to success in a world where significance matters and there's scarcity of resources and you and I love significance and not so much registrations, let's say. But you can also do another thing. As mentioned, you can do one large study. So you have resources for 100 participants. You could do one study or you could do the same trick five times, 20 participants in each study. And that's actually a very efficient strategy. If your only goal is to get significance, you just run multiple small studies, the chance of finding something is much bigger, particularly when you go p-hack. And this is what we found in the replication. So to be clear, the replication here is a meta replication. It's a replication of a study on replications in a special issue on replications dating from 2012 by the same, well mostly the same research team at least, if it comes to me. Slight deviations, that's the difference in the lines that are very close, which is interesting, that costs Amir quite a lot of time to uncover why that was the case, but it's actually some path dependency. And if you come to think of it and use a platform like SAM, which is modular, you notice immediately. So there's dependencies and very minor differences, depending on how you implement it. And then it also costs a lot of doubt on general conclusions based on simulations like this, because they might, you know, setting it up a bit differently might create completely different outcomes. So this is a two effect size, going, standardizing the difference going from zero to one. And then the left-hand side, what you see here is that the dashed lines represent the strategy of small studies. So instead of the one study of a hundred participants, that's the full line, you run five of 20. And the chance, the mere chance of significant results can conclude that, obviously, you don't need to simulate anything for that under the null, is it much higher than it is for one single study? And so what you see here is the left-hand sign that it's actually a very, the dotted lines do much better up until 0.50 or so effect size. The domain strategy, particularly when you go p-hack using the tricks as simulated here, this is the winning strategy. You can't explain why people do it. This obviously comes at a major cost in terms of estimation and bias. And that's on the right-hand sign. So the effect size biases are massive in this scenario, which entails 20 participants in total, 10 per shell, which is the end, what the end means. So that's the problem. Massive bias, even if the effect is true, you have a problem, another small one either. So that's what Marjan found. And this is actually for different levels of n. But that's in the paper. We replicated it. It was a nice effort to do it to make sure it works well. And then given that you've implemented it in vastly powerful machinery like SAM, you can easily change anything. And this is, of course, what we wanted to do. So massive bias winning strategy in these scenarios, at least with alpha of all five, is running small studies, which might explain why there are so common in psychology over these bigger studies, although they're diminishing a bit, given the slow increase in sample size that we see in work by Marjan and others. So let's redefine statistical significance, at least for the first study in the field for novel discoveries, they said Benjamin and his big group of co-authors, when you go with, well, familiar names. And the question then is here, how does that change the rules of the game, as we call that title of the paper? Does it still pay off to run small studies? Probably not. And this is the same result in three columns that you just found. So the right hand side is for alpha of all five. In the middle is the all five, Benjamin and his colleagues recommended. And why not? Because you can add another level of alpha on the left hand side. And obviously, because it's more than you can, you know, rerun this with all sorts of other types of inferential criteria, all sorts of other types of analysis, just child's play in a sense, like some debugging sometimes maybe, but that's what you can do. So lowering the alpha, what does it do? Well, that's statistics one on one, it lowers the chance of false positives. So that's the left hand side, obviously, of the chart here, the four effect sizes that are zero. And then as you go up, of course, the power is lower. So the chance of finding anything, even if you do the good thing, which is the fixed blue line here, it's a big study, it's not still that big, but at least you don't know if you're going to be hacking powers very low and gets low, obviously, as you increase and lower the alpha. But it's also very important to note that in the right hand side, until 0.5 or so, we saw that the small sample strategy was still effective. It was actually the winning strategy, just at the higher one, particularly when the researcher is willing to be hacked. That's the green dotted line on the top on the right. But it actually changed a lot when you lower the alpha. So now it's actually the large study strategy, so picking the 100 instead of the 2, sorry, the 5 times 20, that is giving you more of an overall chance of finding an effect. So if that's the scenario, again, as simulated in these cases, then actually people would more often choose rather the big sample over the small one when the alpha is lower, which is actually good news because the bias still is massive. In a way, it doesn't matter. It's a function of the lack of power across the board, but the power will be lower given the same sample size, obviously, when the alpha gets to become lower. So the biases are basically the same, but instead of using the very nasty biased strategies on the upper right hand side, the dashed lines, which cause massive biases even for 40, so that's 40 times 2, 80 participants to sell for experimental psychology is big. It's much bigger than the average. The massive bias that you see for the small sample study would be lower to the lower thresholds. So the large sample studies will maybe become more common, which is a good fine effect. The bias is still remaining. It's a bit higher, actually, even there. Although, yeah, if you p-hack, then it might become a bit less. It's a power booster, so that's interesting. But anyway, so this is what you could do, but also you could expand this. So this is also a more aggressive form of p-hacking. And there's so many different ways to p-hack. And we have no idea what people do. It would be good to keep track of the random sample of studies unregistered of all the analysis that have been done and then do some cool analysis of that to see what people actually do. But we don't have enough of that type of study. So this is just another way of doing the optional stopping. I don't expect that to do much, but that's the topic of Esther's talk after this. Removing outliers. While you change the K, which is the threshold for what you call an outlier, you call that subjective K, not a paper on outlier deletion and bias. So that's a really aggressive way. You just lower the threshold of what is an outlier until you find something significant. And if you do that, you know, you get bias because that pays off. But it enters in the sphere of more than a bit questionable, I think, in terms of integrity. It's really bad. And if you report that truthfully, most researchers would say, what the hell? Why do you exclude an outlier with a Z value of 1.6? It doesn't make any sense. Okay. Oh, so to wrap up, the first one I'm not going to repeat. We didn't have to do any simulation for that. That low in alpha low is the false positive. According to Benjamin, obviously it has some nice positive effects on the positive predict value among others. And hence, probably also in replicability. But that's also a very major point of discussion that we've seen. But the nice thing, that's the good news in green, that, yeah, the running of small studies with such a higher level of evidence with alpha lower than alpha 5 actually increases, demands even the p-hackers to use bigger samples. So that's the good news. The biases will remain for larger studies will actually increase here and there. But given that you're forced to have a bigger sample, unless you do very aggressive p-hacking, as it showed in this slide here, but that's just weird. The pattern is also going up as you as the two effects, I suppose, is weirder than in life. So the biases are not going to go away, but there might be something to it, I would say. Now, this is just all contingent on this simulation, which is picked what my own need, we added a couple of things to that. And that's also the main point of discussion. So do I dare, you know, to make general statements and recommendations as did Benjamin and all people? Well, not at this point, I would say more work is needed because there's so many different things that you that play a role. So this is a particular scenario, but you also have to take into account resources, the incentive structure where defining a novel effect pays off typically in the grand scheme and in recognitions and rewards. And in such a competitive environment, you know, there might also be all sorts of dependencies that I come to immediately see that might also play a role. At the same time, it's also nice to know that that's what I like about the sentiment is that you put the burden of proof of original offers, they also get most to win in terms of running big samples, and I think big samples are always better. So hooray for that. But much more molding is needed. And that's, you know, what you can do with SAMHSA, that's the future. And hopefully next year or the year after that, we present some more work with SAMHSA on this. So this is the Meta Group and the funder that made, you know, the money, gave us the money and the link to the SAMHSA paper. Thank you. Thank you, Yelta. We have time to answer, I think, one quick question. There is a question I saw in the Q&A from Gilbert already. He asked a question. The recent publications suggest that indeed we lower that p-value threshold from 0.05 to 0.005. They modeled their calculations with 0.005 to understand how it might impact the scientific process. False positives will be reduced drastically, but the total resource consumption will rise. And then, yeah, they ask, what do you suggest basically? Yeah, so that's a good point. So you have to think, that's not typical in these types of scenarios. And we all recognize as meta researchers that there's elements to this related to getting resources. You can model it as some have done in a particular framework. But it is important to do that, to see how these behaviors are affected by resources and collaboration networks even going beyond individual researchers, which makes it even more complex. So I'm pretty sure you can create very cool scenarios and contact and research questions in that regard. You know, you have to build modules for that as well. So it's still not ready, but you can do that there. So that's the poofy. I'll also address this issue, I think, in the discussion section. Thank you so much. Thanks. Thanks, Jelte. Okay, I'm going to set up my presentation real quick. Should be able to see it right now. I'm having some difficulties, technical difficulties, but if it's not visible, I would gladly hear it. Looks okay. Okay, then it's great. Then I'm just going to start. Okay. Thanks to the previous speakers. Hello, everyone. I'm happy to be part of this symposium as a speaker as well. I'm going to give a talk that shows basically the results of a simulation study performed in SAM. We investigated the effect of multiple questionable research practices on the meta-analytic pooled estimates and heterogeneity estimates. And I'm going to keep my introduction very, very short. Our study is heavily inspired by the, I can say, famous articles investigating the self-admission rate of engaging in questionable research practices among American, Italian, and German psychologists. And our first aim with SAM was to investigate most of these QRPs separately. And as Jelte already mentioned from previous research, we also know that the effects of publication bias can be quite detrimental. Bias through the use of questionable research practices should be smaller than the bias of publication bias. We are interested in finding out which QRPs are more detrimental than others. So in this study, which is pre-registered at the link you see below, we were interested in the effect of five different QRPs on the meta-analytic effect size estimate and heterogeneity estimates. We also checked p-value distributions of the individual effect sizes to see whether there might be a bump below 0.05. But these results are not included in this presentation due to time constraints. We also varied some factors across all conditions. We varied the true effect size by five levels. We chose these mean differences because they are the thresholds and the standards used by Cohen. And the 0.35 estimate is commonly found as a median effect size in psychological meta-analyses. Based on these effect sizes, we calculated the required sample size with a power of 0.80. We derived the true heterogeneity estimates from a previous study of psychological meta-analyses. And finally, we varied the number of studies by 25 and 125. Before I show you results, I'm going to give you a quick overview of how these QRPs work, what they are, and what the process behind them is. So naturally, we will have a baseline condition of no QRPs in which we have two groups and one dependent variable. And we conduct a t-test between these two groups. And from that, from that t-test, we will get a mean estimate of p-value. And we will include this effect in the meta-analysis regardless of its outcome, regardless of significance. Our first QRP is optional stopping, for which we have two levels. This is the moderate level. We again have two groups and one dependent variable. And we perform a t-test. And we include the effect in the meta-analysis if the effect is significant and positive. So positive means in a direction that we expected. If this is not the case, in the moderate case of optional stopping, we add five people per group, so 10 in total. And we test again if we do not find a significant effect and a positive result. We add some more. And then again, if we fail, we add another round. We do this for three rounds max or until we have a positive significant effect. And at the end of the three rounds, the effect that we have with the larger sample size is added to the meta-analysis. Then we also have an extreme optional stopping case. Here we again first test on our original sample like this. And if we do not find a significant and positive effect, we add 15 people per group. So that's quite some more than the previous one, making 30 in total. And we do that for three rounds. Or we stop once a significant and positive effect is found. At the end of the three rounds, we include the effect that we have in the new sample. Then our second QRP is selective outcome reporting, which is also known as outcome switching, and which has a large admittance rate between 40 and 70%. I believe in the studies that I just mentioned. Here again, we have two groups, but now we have four dependent variables instead of one. Please note here also that the true effect size is the same across the outcomes and the outcomes are correlated at 0.40. We want to emulate a researcher who tests multiple outcomes and then takes the first significant outcome that they find and discards the rest. But because SAM simulates all results at the same time, what we do is we check all of them, we check which of the effects are significant and positive, and of all of those we chose one randomly. So that means that it's not necessarily always the one with the lowest p-value. This is the way that we try to simulate the first significant outcome. Then the more extreme version of selective outcome reporting, where we have two groups and four DVs like before. We test all four and then we do pick the maximum one, the one with the lowest p-value, and we discard the rest. This one gets included in the meta analysis. Outlier removal also has two levels, moderate and extreme. We have two groups and one dependent variable again, and we select the effect if it is significant and positive. If we do not find such an effect, we want to identify and delete outliers from the sample. So in a moderate level of outlier removal, we delete observations that have a Z score smaller than minus 2.5 and larger than 2.5, which corresponds to the 99% confidence interval for the mean. So these people drop out, we test again and include the effect in the meta analysis. Then in the extreme outlier removal condition, we identify outliers that have a Z score smaller than minus 2.5 in the treatment group and a Z score larger than 2.5 in the control group, meaning that we're more selective in who we throw out and we create a bigger distance between the groups. We drop these people out, we perform a t-test and we include the effect in the meta analysis. Then our multiple condition QRP has three groups or three conditions that I would like to call control, treatment one and treatment two. Again, one dependent variable with the same true effect size across these groups. And we first test control versus treatment one and basically we pretend treatment two does not exist. Then if this does not give us a significant positive effect, we do the same thing, but then we test control versus treatment two. Then finally, if that also does not work, we combine the two treatment groups together and test that big treatment group against the control group. If that does not give us a significant and positive effect, we select out of these three effects, the one with the smallest p-value. That is the one that gets included in the meta analysis. Then our final QRP is optional dropping. Optional dropping is also called a subgroup analysis. And here what we wanted to do is we wanted to simulate a practice where participants get dropped out of the study based on the score that they have on another variable. So for example, gender. If we find a non-significant effect for the whole group, as you see here, then we do subgroup analysis to find a significant effect. We simulated this in such a way that 50% of the sample has a score of zero and 50% of the sample has a score of one on the dummies. So we would first perform a t-test for the entire group. And if that one is not significant, we would test the effect on the first subgroup and drop the other half out. And if that doesn't work, then the other subgroup, of course. If this doesn't work, we select the second dummy and drop half out or the other half. And finally, we would do the same, of course, for dummy number three. If at the end, we would not have found a significant effect, we select the one with the smallest p-value from these seven tests. Okay, those were the QRPs. I'm now going to show you the results. The results are quite similar over all number of studies on the meta-analysis. So I'm only showing you the KIS-25 condition, which means the meta-analysis consists of 25 studies. Because you can do so much with SAM, I've collected many results and there are many different things that are interesting to discuss, but I will only show a few. In a minute, I'm going to show you a larger graph with many grids, but to get accustomed to the graph here is just one with a smaller example. So what we see here in the columns is two types of QRPs, selective outcome reporting, first significant effect, selective outcome reporting, max or most significant effect. On the rows, we see the results for the between study heterogeneity, zero and 0.01. And then within one plot, you see here the true effect size on the x-axis and the y-axis always contains the bias. And in this case, it's the individual level effect size bias that we're interested in. And of course, the lines represent the different sample sizes per group. So we see for selective outcome recording first, so this column, that the bias decreases as true effect size increases. And this makes sense, I guess, right? Because power increases in effect size and in sample size. And we also see across the rows here that bias increases as heterogeneity increases. In contrast, the selective outcome reporting, max, has a larger amount of bias overall. And the bias stays almost constant across levels of the true effect size. Of course, the bias is largest for the smaller sample sizes. And similarly here, also the bias increases across tau squared. And as you can see, the bias is quite substantial, especially with the small sample sizes. So this is an example for two QRPs and two levels of tau. These are our complete results. Again, in the columns, we see the QRPs. The rows are the tau squared values. And within each plot, the x-axis is still the true effect size. And the y-axis is the individual level bias. Here we see the same result, right? So for most QRPs, the bias decreases as the sample size increases. You see that in the difference of the lines here. It decreases in D. So as effect size rises, the bias goes down for most of them, I should say. The bias also increases in the amount of heterogeneity. So in that sense, the results are pretty consistent. Specifically, the two versions of selective outcome reporting and optional dropping have large biasing effects. And a side note I should make here is that with optional dropping, if we have not found a significant effect across all comparisons, we chose the smallest p-value, right? So this means that the optional dropping step actually contains the extreme form of selective outcome reporting at the end. And that may explain the large biasing effect that we see here. And this is also the case for the multiple condition one. Here, the bias is also quite large. And here, of course, at the end, we also select that final the p-value that is smallest. Optional stopping has a small effect overall for both versions. And like Yalta says, this result makes sense. And because we keep sampling until we have just found significance, meaning that the effect size will be inflated, but not so much. But what's surprising to me, specifically in context of the previous talk, is that the outlier conditions have such small amounts of bias. It could be that our z-value was too large and that changing this value might give different results. It's probably the case. So maybe we should choose a lower threshold here to investigate that effect, because there have been studies, as you have seen, right, that found a significant impact of outlier removal. So we have to investigate this further. Okay, next we're going to see the same results as we see now, but now on the meta-analysis level. So the bias we're going to see is the bias in the meta-analytic estimate. So here we see the results of these individual effect sizes being put in the meta-analysis. The results are very similar to the previous one. I'm going to go back to the previous slide and then back to this one. As you can see, they're quite similar, right? The conditions where selective reporting in any form takes place has the largest bias in effect. And what is clear from these results is that the peaks of the bias seem to be around common effect sizes and common sample sizes in psychology, which I think is quite problematic. Let me see. In the interest of time, I'm going to show you the final result, the effect of QRPs on the tau-squared estimate in meta-analysis. We are looking at the same graph. It's constructed the same way, but now the y-axis here is the tau bias. And as you see on the top of these graphs, there is a zero bias. We are also looking only at the results of 25 studies in the meta-analysis. Here I should note that the results for five studies are a bit more pronounced than this, as are the results of the 125 studies are a bit less pronounced. And this is quite problematic, right? That these results for the five studies are more pronounced because the median number of studies in psychological meta-analysis is found to be somewhere, I think, between six and 19. So yeah, that's again quite problematic. The main takeaway message that I have here is that QRPs in general underestimate the amount of between-study heterogeneity. This effect is quite negligible for small amounts of heterogeneity, as you can see in these two rows. However, once there is quite some heterogeneity, the underestimations become quite severe for all QRPs, except again, outlier removal, which we have to investigate. Interesting to see here is that optional stopping, which did not have much effect on the individual effect size level and meta-analysis level, has quite some effect on the heterogeneity estimate when heterogeneity is large enough, so these quadrants. And this is problematic because not only do QRPs underestimate heterogeneity, publication bias does as well. My colleague Hilda Augustein has done work on this, and especially for small effect sizes. And when there is a lot of heterogeneity, the between-study heterogeneity can be severely underestimated. Problematic because I think it's reasonable to assume that individual studies can be p-hacked and that there is publication bias on top, right? These studies then end up in a meta-analysis and it's very hard to estimate the heterogeneity in a population given these results, given this data. So short summary of the important findings. It is clear, right? Selective outcome reporting and its variants are most detrimental. Those variants are multiple conditions and optional dropping. Optional stopping has a minimal effect and this result is in line with previous studies, which again makes sense because you select studies just below the significance threshold. However, this is not an argument in favor of implementing optional stopping in your research, of course, because we have seen that it can be quite detrimental for heterogeneity estimates. It was surprising to see that outlier removal didn't do much, which is not in line with previous research done, so we have to check that. And finally, QRPs generally overestimate effect size and meta-analytic effect size, but underestimate heterogeneity. And this is in line with what publication bias does, so that combination is pretty worrying, I would say. Then what's next? I'm not finished with interpreting all the results that SAM produced, so I still want to look at p-value distributions. I want to investigate the QRPs that we used, so changing the Z score in the outlier removal case or choosing an alternative way to select that final effect size in the multiple condition and optional dropping scenarios. Of course, this study only looked at the effect of individual QRPs separately, and I think it's reasonable to assume that QRPs can be combined in one study. It's also reasonable to assume that publication bias exists on top of QRPs. I just saw in the presentation of a mirror that there are other QRPs that I can investigate as well, such as rounding a p-value to 0.04 instead of 0.05. We could investigate the effect of publication bias tests. Like I said, the possibilities are endless for now, and I would love to hear your thoughts and suggestions as well. Thank you so much for your attention. Let me check the chat real quick if there's anything. Oh, I see the time. If there are questions, I'm going to answer them in the Q&A by text. Also, feel free to contact me or any of the other presenters if we are running out of time and you still have questions to ask. I'm going to give the floor to Professor Erik-Jel Maag, as we're going to be the discussing for our session. Please take it away, Erik-Jel. Thank you very much and thank you to all the presenters for thought-provoking and I think depressingly realistic perspectives on the scientific process. I'd like to start by just some very general observations. We actually do not have that much time, right? The session is another 10 minutes. Okay, that's great. Yeah, and I don't intend to just have a monologue of 10 minutes. I don't think that's my role here, but I do want to say that recently there's been some debate about the value of pre-registration in particular and I think critics of pre-registration should really watch this session. And if you watch this session and then still argue that it has no use, I don't really understand how that could happen, but the thing is and Jelton mentioned the confirmation bias and it's very strong and it also means that you're all the time on the lookout for information that confirms your prior beliefs. So I doubt that many people who argue against pre-registration would even attend a session like this one and maybe that's also part of the problem that we base our arguments on different sources of information. So I also think it's interesting how can this happen, right? So obviously some people, especially people who just fake their data, they know what they're doing, but I think a large number of people who engage in these practices don't do it intentionally. I think a lot of it happens unwittingly. As Jelton already said, the example where you think you've pre-registered something because it seems so obvious, it's in hindsight a lot of things seem obvious. So and Michael Schirmer actually addressed a similar point. He asked the question, why do smart people believe weird things? Oh, one second. I have a four-year-old son who just walked in only one year younger than Jelton's son, but just as loud I'm sure. Anyway, so Michael Schirmer argued that smart people believe weird things because they're so good at finding arguments why their beliefs shouldn't be challenged. And I think it's exactly the same way in the scientific process where if you come up with a particular analysis strategy and it yields the result that you think should be the right result, you feel validated and you feel that it's obvious that your choices were not ad hoc, were not chosen specifically to achieve that result, but were entirely reasonable. So I think that's a big problem with people being sometimes a bit too smart. And in general, I would also say particularly before the whole replication crisis happened, you had certain labs that really applied what was called at that time sort of when nobody was listening was called the shotgun approach to science, where you have your experiment and you think like, what conditions should I include in my experiment? And you basically overload your experiment with conditions and then you shoot this shotgun and you look at the wall and you see what's stuck. And obviously if you combine that with hindsight bias and confirmation bias, then I think it's a recipe for disaster. So I have a few, I do want to say something about those P values because there was actually an interesting question and a response to that with a link to the paper by Gilbert Schoenfelder and he also referred to this paper. So I thought this is interesting. I'm going to look it up and it turns out I was the editor for that paper. So that was funny. But I will say one more thing about and I think it relates to something that Esther was saying with meta-analyses. So one additional QPR that often comes into play with meta-analyses is when you add covariates or do subgroup analyses. And I had to think of a meta-analysis I was involved in and it was pre-registered and nothing was statistically significant. But out of the 17 studies, nine showed a positive effect and eight showed a negative effect. And so one of the reviewers who was a proponent of this particular finding argued that what is really important here is to investigate why those nine studies showed a positive effect. So what separates the successes from the failures? And of course, there was just nothing going on in those 17 studies, but it does show you the desire that some people have to see the effect in their days. So I really thought part of it was depressing. I always have to fight through these talks a little bit because it's just so all these questionable research practices, it's so terrible. But I do think it's really nice to study it systematically and just rub people's noses in it and really show how detrimental these these practices are. So yeah, so I just want to say one more thing about those p-values and then I think we should just hear from the presenters again. So I was actually a co-author on that paper on proposing to lower the p-value to 0.005 and it is actually when you read that paper, we were actually extremely careful. It should only be done for new discoveries. And I disagree actually that it should only be done for new discoveries. So it was a little bit too meek for my taste. But I also think it's important to evaluate to realize what motivated that proposal. And what motivated that proposal is if you look at p-values close to the 0.05 boundary, evidentially they mean almost nothing because the data are just as likely under the null hypothesis as under the alternative hypothesis. So the problem arises here because what you're trying to do when you use the usual a p-value significance test is make a discrete yes-no decision. And when you have very ambiguous data, I think what people should do is they shouldn't be forced to make a yes-no decision on ambiguous data. They should just say these data are ambiguous. So everything lower than 0.005 or usual sample sizes would be good evidence that there is something going on. Everything, well, the problem of course is how do you assess evidence in favor of absence? Obviously the Bayesian approach would be the way to go here. But those p-values lower than 0.05, but higher than 0.05 usually it would just warrant the conclusion we don't have enough information to make a confident claim. So really I think there's a lot to be said for somehow moving to a situation where you include the category we don't know enough to make a decision. I don't think any scientist should be forced to make a yes-no decision in the situation where the evidence is ambiguous. Okay, so that's my opinion on that. So overall I think it was really wonderful. I think this is excellent material to convince people that the methods we are looking at or that we should resort to methods such as preregistration, such as a multiverse analysis. But what I was actually missing a little bit was a multi-analyst approach because I have the feeling that if you only look at the multiverse you're still massively underestimating the uncertainty because if you have a single analysis pipeline and a lot of research has shown now that if you ask 30 scientists to do an analysis you'll get 30 different analysis pipelines and in work like fMRI it's even more variable and you'll get a lot of different outcomes there as well. So I think we move to a situation where we pay more attention just to the multiverse but also use a multi-analyst approach and it's all going to take a lot more time and effort and resources but I think it's the only way forward ultimately. So yeah that was still a long monologue but I think I'm just going to stop here and see if anybody has anything to add. Looking at the other panelists, it's indeed the case that's actually in the moral decision making. They have these paradigms where they let the participants lie in circumstances. These are all white lies in circumstances that they cannot be checked. So it's actually also there some research suggesting that more intelligent, more creative people lie more because they just justify whether there's ever actual actually completely in agreement with you that this is a problem. To smart people, to creative people can you know convince themselves very easily and indeed we need to let people know the bias is involved and it all goes back to the belief in the law of small numbers and you know the complete failure even among statisticians to understand something or to have a good assessment of it. The intuition is so far off for most people even if you have to compute the stuff. So I actually I really want to support what you're saying here because I think in particular when you look at statisticians and even a methodologist if you look what is generally published in say psychological methods for me to me it's rearranging the deck chairs on the Titanic because it's like another little tweak on the ANOVA and how you compute degrees of freedom in an unbalanced this and that and the other thing and it has no consequence whatsoever in the presence of these this huge bias and this huge lack of transparency. So it's really I think there's an elephant in the room here and a lot of people I'm not sure whether they're even aware of it right because as statisticians and methodologists you are really in this ivory tower where you can pretend that you know balls are drawn from urns and coins are tossed an infinite number of times and that's just not what happens in the real world. I'm sorry but I'm going to wrap it up we are out of time thanks to of course everyone the panelists here thank you for your presentation Professor Aikian thank you for your discussion thanks for the participants of course if you want any if you want to discuss with us further there was a link in the chat just now in the Reno chat we are of course also readily to easily to find on the internet if you want to contact us thank you so much. Yes thanks very much to all the panel we're going to wrap up this session now we'll be ending the webinar I've put the link in the chat to the Reno networking platform and the link to the meta science 2021 website so you can find more information about all the sessions about the speakers and also to the Slack channel if you want to carry on these interactions and engagements through the future the next session will be starting at 12 p.m eastern daylight time 5 p.m uk time and it is on the uptake of pre-registrations I think really topical for what all the panelists have been talking about today and this may be my own confirmation bias speaking but I think it's going to be an excellent session so thanks to everyone for listening thanks to all the panelists and thanks for attending meta science 2021.