 Okay, thanks everybody. Thanks to BIDS for having me. Thanks everyone for coming. So I'm gonna talk about, let's see which one of these, this needs to be on, huh? Yeah, and yes, work I've been doing with Richard McIlary over at UC Davis. When we got to talking about sort of, I guess fundamental issues in doing science sort of broad stroke and through a series of conversations embarked on the path that led us to the paper that I'm gonna talk about. So I'm gonna open with something that seems a little tangential, which is medical testing for chromosomal disorders. So this is a, the maternity 21 blood test is a pretty new procedure. My wife had it when she was pregnant with our baby. This is a blood test which separates fetal DNA from the mother's blood and it's used as a pre-screening for trisomies like Down syndrome and Edward syndrome, sort of extra chromosome disorders. And I read this article, I actually read it in the Boston Globe but it was in the work of the New England Center for Investigative Reporting. And they were concerned because they found that about 6% of women who had tested positive for Edward syndrome were getting abortions without any further testing. So Edward syndrome, which is trisomy 18 is a really horrible genetic disorder. Most fetuses don't actually make it to full term. Most of those who do don't make it past their first year of life. So it's, you know, if abortion is on the table at all, it's definitely on the table if you're sure that Edward syndrome is present. The article that I read was sort of taking the pharmaceutical company or the biomedical company that does the test to task because they claim a 99% detection rate for this disorder. But then the journalist was saying, well, but you know, when they did studies and the invasive amniocentesis screening, et cetera, they found that only about 64% of the women who had an initial positive test ended up positive. So clearly this is a discrepancy. They claim 99% accuracy, but really there's only 64% accuracy. The problem is that that's equating two things that are not equivalent, right? The probability that something is, gets a positive result given that it's true is not the same thing as the probability that something is true given there's a positive result. Right? This is, I mean, I don't think I need to get that deep into it. You know, I went around and looked up the base, the estimated base rate is that Edward syndrome is present in about one in every 2,500 pregnancies. So we can throw these numbers into base theorem and say there's a false positive rate of about two in 10,000. And two in 10,000 is great. It means 10,000 or in 5,000 tests, only one is gonna have a false positive. But that's still enough to get you only 64% certainty when you get a positive test. And there's actually not a ton of data on Edward syndrome. There's a lot more data on Down syndrome for which the best false positive rates are more like two in 1,000. And if that's the case, then the probability that there's actually the syndrome given a positive result is only about 16%. All right, so why have I bummed everybody out about Edward syndrome? It's because I think there's a lot of parallels to be drawn about the scientific process. All right, so as scientists we try to, we use experiments to try to figure out the true state of the world. And we might get a negative result and be angry or there might be good news and we get a positive result and our hypothesis is supported. But as scientists we're faced with the exact same problem that clinicians are when they're doing these tests. We have a result and it's an indicator of truth. But all we have is the positive result and we're trying to correlate our result with the true state of the world. And there's a further advantage that the clinicians have in which these tests come out of clinical trials in which thousands, hopefully at least hundreds, usually are of replications and studies are done and they aggregate the data the same tests over and over and over again. So the relationship between a result and the presence of the condition can be estimated pretty well. As scientists we're investigating novel hypotheses. Our tests aren't necessarily as well calibrated. We can estimate things like power and false positive rate but things like base rate which is the background rate of a priori probability that a hypothesis is true is much harder to estimate. So when people get educated about how to do research in grad school, I mean I was taught the things to look for are power and false positive rate. So we use the type one and type two errors. We want to minimize both kinds of errors. Of course if we get no type one errors then we're very prone to type two errors and vice versa. But we use, well this is getting less and less now but we set P8 to 0.05 and we hope for a power of about 0.8 and that does all right for us but the thing that we generally ignore is the base rate which is the probability before you've done any studies that what we're gonna test might be true. So just to give a sort of toy example, this is the sort of power of 0.8, alpha of 0.5. This seems like a pretty, most scientists I think or at least most social scientists would be very willing to accept these kinds of numbers and so imagine we have a hundred hypotheses that we're testing and only 10 of them are true. There's a base rate of 0.1, that's pretty good. If every relationship we think might be present one in 10 is there, that's not bad. So we do all our studies and 80% of the true ones come back positive and 5% of the false ones come back positive but there's still way more false positive, false hypotheses than true hypotheses. So in this case, we have about 62% of our positive results are actually true which means almost 40% of them, the things we have positive results for are actually false things that we think might be true because we got a positive result. This is not bad but it is a discrepancy and these are conditions that are pretty positive, pretty optimistic about the state of science and still only 60% of our initial positive results. Should we expect to be actually true states of the world? All right and everything I said is basically in this paper that John Ioannini published 10 years ago why most published research findings are false and he was specifically addressing the medical community things like whole genome association tests where they have, you're testing 100,000 snips for a relationship between genes and some medical condition where you might expect that five or 10 genes are actually associated with. So in that case the base rate is one in 10,000 which is a lot less than one in 10. And this is a graph which is not any of this paper but should be which is basically just, this is the base rate, this is our assumed background rate of true hypotheses in the field that we're testing and this is assuming a false positive rate of 5% and the x-axis is on our log scale here and the y-axis is the post-study probability that the hypothesis is true given a positive result. And again throughout this whole thing I'm gonna be using things like positive result and true in a very sort of binary way which is clearly an oversimplification but as a sort of first pass model of science it's sort of what I'm going for. What's clear here is that there's a thin gray line through the middle of that graph and that's 50% probability. And what you can see is even when the power is 99% or even if it's 100% we always get a positive result for true hypotheses. If our base rate is below about 5% then most positive findings will be wrong, will be related to true, to false hypotheses. All right, the base rate, there's good evidence that the base rate is actually pretty low in a lot of fields, right? There's false positives that prop up again and again. Many of you probably familiar with some of these, this is on the upper left is the dead salmon experiment where they put a dead salmon in an MRI scanner and showed it neutral and emotionally valenced images and got a positive statistically significant bold signal right in the middle of the salmon's brain and that salmon had been dead for quite some time. I put, this is the Daryl Bam's 2011 paper in JPSP where he presented evidence for precognition, which is just amazing paper in the sense that, in every sense of the word, literally unbelievable. And then there was this paper in nature, it's kind of a perspective space that Begley and Ellis talked about. Amgen, the biotech company, tried to replicate 53, what they considered landmark studies in oncology and hematology and found that 47 out of 53 failed to replicate. Most of the studies were published in places. The initial studies were published in places like nature, science, cell. So we know that there's a lot of false positives out there, right? Some of you may have seen this just recently, this is the many labs group where they took a whole bunch of well-known, well-cited psychology experiments. And what's great about this paper is they were looking for what some psychologists can call the semester effect, which is the results should be different whether or not the subjects are tested at the beginning of the semester or the end of the semester for whatever reason you could throw at them. They found very little evidence for a semester effect, but also they found very little evidence for most of the main effects that these studies reported. So the upper right at the top is the strupe effect, strupe rock solid, strupe totally replicates. Most other things, so the green triangles here are the, is the reported effect size in the initial published studies and they got 20 psychology labs to independently replicate each of these studies. And so the X's are each of those labs results, the blue dot is the sort of mean and standard error of those results and yeah, the vast, the majority of the studies failed to replicate. All right, so there's a lot of false positives, but clearly I'm talking about replication, right? This is not an impossible situation. All right, what about replication? This has been, a lot of people are talking about this, including this group here. If we, you know, we just, we have to replicate things. It's really important to replicate. This is how science advances and it's been nice to see that, you know, replications are starting to get published. Most of them in places like POS-1, but also in some real sort of classic Elsevier journals and psychology journals, other kinds of mainstream journals. I really like this one on the left. This is Brown-Schmidt and Horton because they published a failure to replicate their own study. So, so yeah, so we just, we just need to support replication and things will be better. But it's not that easy, right? Cause people are very concerned about exactly how we're going to do replication. How are we going to publish replications? How are we going to assess replications? Basically there's two things here. One is, well, they're kind of two sides at the same point. When is it, replications don't get published. Negative results don't get published as well. People aren't publishing things when they fail to find a positive result. This is these two, the upper left and lower right, these papers came out pretty recently saying people don't publish negative results and increasingly so. They used to publish more negative results, they publish less and less now. And this is not just the Franco et al. study that came out last year indicates that this is not at the level of the journals primarily. This is at the level of the labs. They're just not even submitting negative results. Then you have other people basically saying, well, if a study fails to replicate, then that's not a big deal. We shouldn't really bother about that because replications tend to be lower powered than the initial study. They don't have all the lab expertise of the initial study. They might be just doing things a little bit differently. This is certainly true in let's say biochemistry where you have to mix very careful solutions or have lab conditions. There's a social psychologist, Jason Mitchell, who got some press for putting this article just on his website where he claimed that failures to replicate are absolutely worthless and should never be published. They serve no point in science. Yeah, you should read his article. It's still on his website. Yeah. I mean, actually, I can back him up even though I disagree with him. I mean, I think he's not being... No, I mean, well, I think he makes a very careful argument that I think is based on some wrong assumptions. But clearly this is, I mean, he's not a stupid guy. And this is Mina Bissell who's here at Berkeley on the lower left. And I think she makes a very reasonable claim for why some replications would have a lower power than the initial studies, right? So with all of this sort of background in mind, this brings up to the questions that we tackled this, our study with, right? Which is one, what is the evidential value of replication? What is the impact of communication and publication bias, different kinds of strategies? Should we publish everything? Should we publish only positive replications? If science is an invisible hand that self-corrects over time, how much replication do we actually need? And then, you know, how bad is it if replications are underpowered relative to initial studies? Are they still worth something? So I'm gonna take you through a sort of schematic of the model that we're working with. This is a mathematical model that Richard and I built. That starts out with the sort of cycle of scientists. We have a population of scientists and step one is the scientist chooses a hypothesis to study. The hypothesis, sorry, the scientist either takes a totally new novel hypothesis or replicates a previous result with probability R, which is the rate of replication. And we get more into different kinds of replication later, but starting with this. And of those novel hypotheses, some of them will be true and some of them will be false. And this is the base rate, is the probability that they're true. And then of the tested hypotheses, some of those will actually be true, some of them will actually be false. But we don't know what their true epistemic state is. All we know is the results of our studies, which accrue. And we assume that scientists can keep track of the difference between the positive and negative results. So they keep a sort of tally, which gets added one every time a positive result happens and minus one every time a negative result happens. And this is our sort of strong assumption of the way that our sort of fictional science world has happened. So after choosing hypothesis, there's an investigation. And the true, the real truth of the hypothesis interacts with the probability of a positive or negative result. No, the size of the circle there is, so there's sort of, it's harder to see here. There's sort of rings around it, which just indicate the most recent result. So you have a red ring as a negative result and a green ring you're adding to, yeah. So each of, so the novel hypotheses have a sort of gray shell, which means we don't have no idea about the epistemic state. And then the test hypotheses, we have the results that have been published. Yeah, so this is a sort of schematic. In the model itself, the researchers don't know the absolute number of studies that have been done. They only know the number of positive results minus the number of negative results. And it turns out mathematically that is, it doesn't matter. Because it's not a probability distribution. Like that's over all the real numbers. It's just the probability of truth versus the probability of falseness. But I can talk more about that later. So there's an investigation step and with power one minus beta, a true hypothesis gets a positive result with false positive rate, alpha a false hypothesis gets a positive result and then the inverses of those for negative results. After the results happen, there's a communication stage. We assume as sort of baseline, all novel positive results are communicated or published. And then everything else is published with some probability related to its state, whether it's a novel result or a replication and whether it's a positive result or a negative result. And so you can see here some, sometimes novel results that are negative go into, this is, you know, there's our file drawer there where people aren't publishing negative results. And only communicated results transfer new information to the set of tested hypotheses. And so there's this sort of C parameters for communication probabilities that we have here. All right, so this is the sort of basic dynamics of the model and it just repeats over time. No, so negative novel results get published with probability C sub n minus. So basically what's happening here is, so yeah, so positive, it's on the very left is positive novel results. And so the dashed lines are not communicated and the solid lines are communicated. But yeah, but all the other results have either, you know, that are not novel, that are replications have some previous knowledge that we have about them. Where as the negative novel results that don't get communicated, we don't learn anything about that hypothesis. So we, this model, we developed and solved totally analytically as sort of using standard techniques from population dynamics. Just to throw some equations here that are in the supplement of our paper that you can look through at your leisure. You know, this is sort of, we get a recursion equation for the number of, you know, what's for each kind of hypothesis, the number of true hypotheses that have state S or which is tally S, where tally is the number of positive results minus the number of negative results. And we can use this to get steady state equations for the frequencies of each kind of hypothesis at each tally. Sort of in the long run, these are the steady state frequency, relative frequencies of the different kinds of hypotheses at each tally. K is the binomial chooser. So K, this is a thing to just basically say like, what is the problem? Basically it's, you're counting all the different ways to get to a particular tally in a, you know, however number of time steps. All right, so the first question we could ask is how many results with tally S are actually true? Where tally S, tally S equals one, you know, might mean we have one novel result, it might mean two not, sorry, one positive result, it might mean two positive results and one negative, but tally one, as a sort of simplified model of science, how good is this, right? So there's lots of numbers here and I'm not, this is, there's seven graphs here that I'm not gonna go through in detail. Again, this is in the paper, but we looked at two scenarios, an optimistic and a pessimistic scenario. Our optimistic scenario is what I presented before, which is base rate of 0.1, power 0.8, alpha 0.05, high communication rates, most findings are published. Our pessimistic scenario is probably like a lot of research, we have a base rate of one in a thousand, power 0.6, alpha 0.1, most things are not published, only about 20% of non-novel positive findings are published here. I wanna say before this that once the paper is published, the Mathematica notebooks will be online as a supplement, so any particular parameter value you can imagine could be played with and these graphs could be generated. So, let's start with the optimistic scenario. If you notice before, most of those lines were flat in the optimistic scenario, the things that matter the most are the base rate and the false positive rate. And what immediately jumps out is that, the base rate matters a lot and in a lot of conditions you need more than a tally of one to have any sort of certainty that your hypothesis is actually true. So each line is a tally, so tally zero, tally one, two, three, four. But the optimistic scenario is a pretty, well it's pretty optimistic but it's a good scenario. As the false positive rate climbs though, of course we can be less certain. There is very good reason to believe that in a lot of research the false positive rate is actually quite a bit higher than 0.05, at least in psychology as people like Simmons and Simonson who demonstrated this probably quite a high false positive rate in that field. In our negative scenario, the base rate again and the false positive rate are the things that matter the most and here we often need three or four or five replications before we can have a better than chance probability that the hypothesis is actually true. We need a lot of positive results to accrue before we can really be certain about finding. Communication also matters in this scenario though. The lines are less steep but they do matter and I'm gonna talk about that in a few slides after this, yeah. Yeah, so a tally zero is, there's a few ways to get it right but you need at least one positive result and at least one negative result. Yeah, no, exactly. Nothing has zero communications in this, it means yeah. So on the previous one, you'll actually see zero, tally zero falls as the power increases because if the power is very high then most positives, most things that have a negative result will be actually false. Okay. So let's talk about communication in a little bit more detail. So Richard and I came up with this sort of analogy of the scientific process, not just doing experiments but also communicating them as a sort of epistemological chromatography. So for those of you familiar with chromatography, this is a process by which basically different parts of a solution are separated out. Through various chemical or physical means. And what science is basically trying to do is we have a whole bunch of hypotheses that we don't know whether or not they're true or false or trying to separate the true ones from the false ones. And this is a gradual process. So this graph here has three different values on the lines. The blue is what we're calling precision, which is the same as before. This is the probability that something with a given tally is actually true. But it's not really that helpful if I tell you that all the hypotheses with a tally of five are true. If no hypotheses actually have a tally of five. So the orange line is what we're calling the sensitivity, which is the probability of a tally of five what we're calling the sensitivity, which is the probability that which is basically where the true hypotheses are. It's the probability of a given tally given that the hypothesis is true. And then the specificity is the reverse to that. It's where the false hypotheses are. So we can look at a sort of first pass publication strategy or communication strategy, which is, so I'll remind you in all of these cases, novel positive results are always communicated. In this case, though, we only publish positive results. So we only publish positive, novel findings and only publish positive replication. So this is sort of Jason Mitchell dream world here. And in this case, it's actually a fairly terrible scenario. We have pretty high specificity only at very high tally, six or seven. This sensitivity is really bad. Most of the true hypotheses are around one. And the specificity is also around one. Most of the false hypotheses are one because we're not publishing failures to replicate. So things that fail to replicate just don't get published. And we get a lot of things with one positive result that are actually false. Just for completeness, we did the reverse case, which is only publishing negative replications, only publishing failures to replicate and positive novel results. And this is a terrible situation. This is a most, this basically, the precision here is nothing. We don't know what, we don't know anything that's true. Most publications at any tally level will be false. Don't do this. So the best scenario we came up with was what we call screen and check, which is don't publish novel negative findings. But publish all novel, but publish all replications whether or not they're confirming or disconfirming. And what you see here is that the precision is quite a bit better, higher than 50% at a tally of three. And the sensitivity is not bad. There's a kind of a neat finding here where if you repress positive replications, you actually increase the precision but at the cost of sensitivity. And this is just because you're keeping true hypotheses at low tallies in this case. So this is basically sort of of these three, the best because we have the highest precision and the highest sensitivity at lower tallies. And finally, just there's the total communication which is the publish everything case. And this is not as good. The, you can see the precision is about the same as the screen and check, but sensitivity is way worse. And this is because basically our journals will just get flooded with negative with false positives. If we're publishing everything, I mean, nobody would really suggest doing this, but this is sort of as a complete exploration of the model. What if we just published every single hypothesis that was ever tested? Most hypotheses are false. Yeah. Yeah. Right, well, so yeah. Right. Yeah, so you're absolutely right. Some people have suggested we should publish every negative finding. Problem is, yeah, most hypotheses are actually false, which I think there's good evidence that that's the case. You're just gonna have journals cluttered with negative, with true negatives basically. And it's gonna be, maybe we'll have, if there was a separate journal for only those kinds of results, then we could put them aside and then do our, concentrate our search to the journals that aren't that. Yeah. Yeah. Yeah, so if that's what the first thing you said is what I came across as saying, then I apologize, cause that's not what I mean. Yeah, though, this is definitely within the same formal structure. And this is basically saying that when we look for where the true hypotheses are, they're mostly in things with tally one because they will occasionally be because occasionally we'll find a true hypothesis and we'll test it and it'll come up with a positive result. But most things that we choose to test and most things we choose to replicate will be false. But here it's also just, it's really that most things we choose to test novelty will be false. Now we're gonna clutter the sort of landscape of communication. So this assumes that sort of all of the whole sort of corpus of scientific results is in one big pool that we're drawing from. So the gray line is where the false hypotheses are. Yeah, so in the gray line rate, most of the false hypotheses are a negative one where they should be at least, right? But the thing is if the false positive rate is even somewhat significant like 0.05 and in this case our base rate we're assuming is one in a thousand, then most positives will still be false positives which is the case in a lot of fields, I think. Which is why we need replication and why we need, well, I'll talk about that at the end. But I mean, there are ways to get the base rate up and to get your false positive rate down to sort of make this condition better. Well, even some physicists. And again, these will, the code for all these graphs will be online so different values we've plugged in and generated. All right, so the sort of last big result then is the objection that, well, replication is all fine and dandy as long as we're assuming that the replication had the same power as the initial studies. But what if they don't? What if replications just sort of smaller sample sizes or different lab expertise or any, for any number of reasons like the, as some I think suspect people who do replications are out to get the people who did the initial studies. What then? So again, this is from Jason Mitchell where he assumes that false positives are generally the result of p-hacking, fraud or some sort of scientific uguineous quotes which totally ignores the fact that a low base rate will generate lots of false positives with where everyone has the best of intentions and uses the best methods available. But so he poses this question which is if a replication effort were to be capable of identifying empirically questionable results, it would have to employ flawless experimenters. Otherwise, how do we identify replications that fail simply because of undetected experiment error? This is an important question. And any given replication needs to be scrutinized with at least as much scrutiny as the initial studies, if not more. But what about it's sort of at the population level, how do sort of underpowered replications affect the corpus of scientific knowledge? So just as here's an example where we have, again, same parameters as before, a base rate of one in a thousand, alpha point of five, 20% replication, which is probably pretty high. And we're always communicating positive replications. So here's a study where, and never novel, so this is a screen and check kind of scenario. So again, we assume high power for the initial studies of point eight, but that the replication studies have a much lower power of point five. How does this affect things? So on the top graph is the case where we publish all the failures to replicate. And on the bottom graph is where we publish very few failures to replicate, only 10%. And you can see that in the top graph, the precision is quite a bit higher with only a marginal decrease in sensitivity. And so we see here that a tally of point three in the full communication, or sorry, a tally of three in the full communication case, you have around an 80% chance that that represents a true hypothesis, whereas you need about a tally of five in the low communication case to get the same precision. So at least in terms of the model, the results indicate that suppressing negative replication is not a good thing for science on a population level. That we should in fact be publishing failures to replicate because even when they're underpowered, it's still a signal that tells us that conveys non-trivial information about the true value of that hypothesis. Okay, so just to sort of take homes from some of this work, I mean clearly it is not controversial that replication is important. And we just need to remember that replications are vulnerable to the same things that initial studies are. There are problems of low power, high false positive rate, low base rate. Many replications might be needed, especially if negative replications are suppressed one way or another. I think this speaks also to the fact that a lot of people were concerned about the replication thing saying, oh, this is really dangerous because you're ruining people's careers. People make their careers over their big result and their big journal and then this falsification really hurts them. I mean, part of the response to that is too bad, right? That's, you know, we're scientists and we're after the truth, not prestige for prestige's sake. The other thing is we should probably stop rewarding initial novel results quite so highly. We should treat most novel results skeptically with the sort of appropriate degree of uncertainty and caution. Again, so low base rates and high false positive rates are the most important threats to the effectiveness of research across the board. And so this suggests that we should be focusing our efforts to do something about this, which to a large degree is being done, but I think depending on your field, more or less effort probably should be put into this. So a lot of people have been talking about preregistered data analysis. We should put our hypotheses in before we do the study. I don't think this should be required for a study because I think exploratory analyses are really valuable, but it also really helps in terms of lowering the false positive rates because we know what we were actually looking for. It also in the long run will help us to estimate base rates in a number of fields, which is something that we're really interested in doing, but that data is really hard to get at because most people don't tell you what their hypotheses are before they do the research. Also quality theorizing. Clearly in fields like physics, this is pretty well known. In fields like social psychology, maybe less well known. As a theorist, I'm biased because I think that what I do is important, but I think that when we base our experiments and our data analyses on things that are well-supported, well-validated and logically sound theories, we're much more likely to start out with a good probability that what we're looking at is our hypothesis is true. Yeah. Well, not necessarily. So that's only if... So, yeah, one flavor of pre-registration is this idea that there is a guarantee for publishing regardless of your results. What's registered? Well, it's registered, right? But you could publish or, let's say, put in a database what the hypothesis was without guaranteeing a journal publication of the results. So this is what... I was just looking at this database a little bit, this thing called TESS in the social sciences, like time sharing experiments. It's an NSF grant funded thing where researchers have to propose hypotheses to be tested against using sort of cloud computing time. And then there's no... And they do that to the funding agency, and then they test the hypotheses, and then there's no guarantee that it'll be published. I think stuff like that is really valuable. Yeah. They sort of have to communicate for that matter. Like, you really wouldn't be relying on it. It's true. This is a problem we're running into right now. No, I totally know. Yeah. Yeah, and honestly, I mean, I think pre-registration, there are pros and cons to pre-registration, and I think that it's A, not a perfect solution, and B, not the only solution to getting the false positive rate down. But it is sort of one that's been suggested, and I think has some merit. And the other, yeah, and I just went on, I should be able about how important theory is. So yeah, raw theory. Also, suppression of initial negative findings might be a good thing if most hypotheses are in fact false. But suppression of negative replications is probably a bad thing. Yeah, I mean, so I think what it speaks to is a more nuanced view of the sort of scope of results and where we're getting, you know, if we, if our database of results is different than our database of hypotheses, then that might not be a conflict. But yeah, I mean, in the scope of the model, there is no such thing as a pre-registered hypothesis as its only result, right? This is an outside the model suggestion rather than an inside the model suggestion. The last point is definitely an inside the model suggestion. Okay. So this is what I wanted to say about this model. So I just, with the last couple slides, I just wanted to talk about some of this or other stuff related to this that we're working on and thinking about. One of them is this estimating, trying to estimate the base rate. It's really hard to do. So, you know, we had, we sent this paper for the first, for review and the first round of reviews came back and the reviewer said, well, you guys really need to estimate the base rate. And we were like, that's literally impossible. And they were like, oh, the editor was, well, if you can't have, can't estimate the base rate, then we're not gonna publish it. So, I don't know if this is, counts as any sort of, well, it's theory, so it's not really a negative or a positive result, but problems in publication another issue I won't get into. But the base rate's probably lower than a lot of people think. I don't wanna say you use specifically, hopefully, especially after this talk, or thinking about fact that your field might have a low base rate, at least sometimes. Base rate also, I wanna say, is a sort of very broad simplification of fields. I mean, different labs study different things at different times, and a different lab might have an overall base rate. A different researcher might have an overall base rate. And clearly, some hypotheses are more well-motivated than others. But as a sort of, again, broad stroke model type of suggestion, we can say, I think it's pretty reasonable to say, what is the sort of a priori probability that a hypothesis in such and such field X is true. Which is exactly what medical researchers do. They say, what is the base rate of condition X in this population, even though we know that individuals or sort of different demographic groups have different probabilities once we take that information into account. All right, things like, yeah, this is an issue. Not all hypotheses are actually reported. People write their papers based on the data that they have. They don't always tell you what their hypotheses were before they did the analysis. There's things like deliberate things like fishing and peahacking, as well as sort of what Andrew Gellman has been calling researcher degrees of freedom, which is basically the idea that if you choose what analysis you're gonna do after you've seen the data that you have, you're going, that's going to skew the kind of analysis you do. Generating good hypotheses is difficult, and this is not just in the social sciences, even though I like to pick on them, because that's who I work with. But things like flogistan, which is the substance in the world that causes fire and luminiferous ether, spontaneous generation, the miasma theory of disease, which is that there's bad air and that's what caused cholera. These hypotheses persisted for quite a while before they were overturned. So base rate estimation is an important challenge for data sciences, something I'm just starting to work on now. We also have the idea to take this modeling framework to help us think about the incentive structures in science, because a lot of the problems with things like false positives and not communicating negative results is that there's very little incentive to publish negative replications or to do extremely rigorous work that takes years when there are more PhDs and there are jobs and things like hiring and tenure decisions are made on the basis of number of publications, sometimes rather than quality of work, at least for young researchers. And so this is like this really, what I'm gonna show you is done like this week. So it's at a very, very, very, very early stage, but the idea is sort of to take the framework of the original model of scientists doing this process, make a whole population of them that are sort of labs where each lab has their different methods. And so we have this idea that power and false positive rate will be related. So the higher your power, the higher your false positive rate, and how close to that lower right corner your sort of that relationship gets is measured by this effort criterion. It's a simplification. It's where we're starting now. The idea is labs get payoffs from publications, especially novel findings, but they get a penalty for someone, some other lab failing to replicate their result. And this is an evolutionary simulation where successful labs then create progeny labs and unsuccessful labs, fizzle and die out and don't leave any new researchers or successful grad students that go start their own labs. And so yeah, basically the only result I have from this is that when you have, with no replication, the false positive rate climbs and climbs and climbs as labs are just incentivized to put out more output. Once there's replication and those outputs can be overturned, that creates a check on the false positive rate. So this is the sort of broad direction we're going into next. I don't read too much from these results as they're very preliminary. So yeah, in conclusion, I love this is from the onion, just in case anybody thought this was a real article. But it's true, it's funny because it's true. Science is really, really hard. And the joke in this article right is that things like quantum mechanics and molecular crystallography are really, really difficult subjects and that's totally true. But also just in any field, coming up with really good novel hypotheses is hard. And testing those hypotheses rigorously so that our results generate positive or generate the kinds of feedback that is going to help us confirm our hypotheses in a meaningful way is difficult. And just I think that there's a lot of training that seems to gloss over the fact how really hard this is. So that's just the last point I wanted to make. And that's it, thank you. Yeah, that's fine. Are there any questions? Okay. So thinking about where your model might be in reality, the biggest concern probably around the selection of hypotheses. But I mean that's almost certainly not the case in terms of like some sort of like you know, uniform kind of, you know, draw from the most possible hypotheses. In fact, there's a lot of selection that goes on in terms of like family hypotheses that are like for instance, as you get into like philosophy of science, I think there's a paradigm or something like that. I'm wondering if you thought about that and how your model is like, would that kind of issue or make the use of the problem? Yeah, so the question is about the fact that in the model, hypotheses sort of exist independently of one another, whereas in reality, hypotheses are often linked through sort of a broad theoretical framework or parts of them are supported by other parts of others. It's definitely something we thought about. And the idea was for this model to make it as simple as possible because they're really aren't, like this kind of model is pretty new. So this is the first model to look at these questions in particular. But we definitely have thought about the idea of creating a model where hypotheses exist in this sort of n-dimensional space where they can be near or farther to each other in cluster in certain ways or even be connected in some sort of network structure. There was actually a neat paper by Jacob Foster and Carl Bergstrom on cultural holes in scientific fields that kind of gets at this idea of like, this is a kind of broad stroke kind of idea, but connections and disconnections between different research areas. In terms of the robustness of our results to that, I think what you get out of that is linkage in, well it's two things at least. One is linkage in terms of base rates and power and false positive rates within sort of clustered fields. And the other is information transfer between findings. So your results from one study influence the probability that a different hypothesis that you weren't testing is true. I think the exact relationship is something that I can't answer from our results. There's another thing you brought up which is that the sort of non-random selection of hypotheses, especially with regard to replication. And that is at least in a simple fashion something that we did look at because we were sort of in the baseline model, we assumed that replications occur by looking at the body of published literature and picking at random from that something to replicate. And so we thought, well people shouldn't replicate things that they're certain about. If something has five positive replications, it's not really worth my time to go, or I might think it's not anyway, worth my time to go replicate that. Whereas it's something that has one positive and one negative result, that's something that's definitely worth replicating. Or something that just has one positive result. I don't know what that is, so I should replicate that. So we did look at that, and this is like my one extra slide. So targeted replication where with some probability when there's a replication, I take it from this targeted region, which is things with sort of low certainty, and that does increase the sensitivity quite a bit. It doesn't actually increase the precision because it doesn't increase the probability of something being true if it has that tally. It changes the probability of something having that tally. And it pushes the true hypotheses toward higher tallies, which is a good thing. So target. So this is using a, yeah exactly, sorry, this is using screen and check. And where the solid lines are, see if I can remember this. Yeah, the solid lines are targeting and the dashed lines are not targeting. So anyway, we did that. That makes sense. So I think that the point of solutions that they've come up with and they've actually benefited is to deliberately inject false statements that no one even knows about to let everyone work on the similar and get the point of publication in which the case can be over and on the level that says whether or not that is actually true. That's a great idea. So it's something that kind of lasts specifically. Yeah, yeah I didn't touch on that because I wasn't aware of that actually, that thing. That's a great example. There's, it does remind me, there's something I've been looking at a little bit. Cosma Shalizi, the statistician complexity theorist has this blog post, we call it sort of like a neutral theory of insight or something. And it's the idea is to sort of, it's a play on the sort of neutral model of evolution and how, you know, how will we expect gene flow to occur when there's absolutely no selection? And so the idea here is what if you have a field where nothing is true? What, okay, let's say where all hypotheses are wrong. So like parapsychology. So I'm gonna assert that all hypotheses in parapsychology are probably false. We can argue about this later. I know we, stop it! Get out of my mind! So the idea was be like, okay, like let's take a sort of a classical P less than 0.05 threshold. In that case, you know, 5% of initial results will get a positive result. But as we do replications, some of those will, you know, will continue to get false positives. Most of them will be correctly shown to be false. And as we do meta-analyses, these should drop out of the literature as viable hypotheses. Well, you know, what is the sort of distribution of sort of life spans of these hypotheses? And then we can compare that to a particular field and say, how far off is this? So we had this idea, we've been looking into maybe trying to do this with parapsychology. But I think the idea of saying, you know, what is the effect of false signals? Is definitely something, yeah, is a great kind of tool for checking your methods. So yeah, that's a cool example. Yeah, I think it would be, it would be tough in a lot of fields. Well, let's say tougher in some fields. Well, sure. When you could certainly inject noise into a data set. But you could also, yeah, give people random data sets or whatever. Yeah. Yeah, I'd have to think about it some more. You know, you should probably all think about it some more. Yeah, that's definitely something, that's definitely really good food for thought. So thank you. And they sort of have interest in social. There was a question in the back. So we're always, unless our power is 100%, there's always going to be true things that we don't, that we think are false. Or that we don't discover. I mean, even if our power is 100%, we have to choose, you know, for our hypothesis selection is not amazing. There's going to be things that we haven't discovered yet. There are things we haven't discovered yet. Yeah, so I mean, well, the, that's basically covered by the sort of sensitivity. And I can't remember what we call it, the specificity terms in the model. Because we're just saying, these are where the true hypotheses that have been tested are. This is the tallies that they have. We know, we know as the modelers, whether or not a hypothesis is actually true. The researchers in our model don't know. They just know the tallies. So we're saying, under these conditions, this is the probability that a false hypothesis has a tally of blah. And this is the probability that a true hypothesis has a tally of blah. And sometimes a true prob, there's a non-zero probability that true hypotheses have tallies that are negative in most of the conditions in our model. Yeah, and when that's a, right. Well, so, yeah, I mean, that's the, yeah. So this is, I mean, this is almost like, it's like the difference between, you know, evolution, descent by not modification and like abiogenesis, where did life come from? They're sort of separate questions. You know, yeah, hypothesis selection is a really important thing. So, you know, we don't believe out. We want to, you know, we want to discover all the truth about the world. But once we have a hypothesis that some people take seriously, then we want to evaluate it. And, you know, yeah, they're both important questions and our model really only looks at the second one. So the question was basically, correct me if I'm wrong, that, you know, so what we've looked at here is the sort of steady state ideal case of the system evolving to a point where basically this is the level of true hypotheses versus false hypotheses and where they tend to be on average relative to other hypotheses. And you're saying that in a sort of the real life dynamical system of science, the number of true high policies or false ones at a given, you know, that are viewed to be with, viewed with certain degrees of certainty will fluctuate. Is that basically right? Yeah, I just wanted to. Okay. Well, so in this case the, there's no, in this case the researcher isn't doing any sort of subjective assessment of the probability. This is, the probabilities here of being right are true probabilities given a random draw from all things with a given tally. You know, in, I was just reading an article by John Ioannidis where he was talking about the sort of ebb and flow of the rigor of science and how, you know, good various fields are or the sort of, you know, rate at which we separate true high policies from false ones or come up with better or worse theories is not necessarily a linear progression. And there are different stops and starts and, you know, boosts and fallbacks. And that's, yeah, the model, this model at least doesn't cover anything like that. But yeah, it's definitely important. But also they usually have a different impact on people from them insert any number of things here. So far you're not, you're just saying science is trying to find the truth. Sometimes it doesn't care about it. Thank you. And in practice, that's just how funding is. And truly there are going to be more people looking at high policies that are going to be useful that are obvious steps forward from what is currently available in the region. So do you see perhaps along the line for what you've said before about having a way to build a research program and things like that into the high policies, including a single area but also between the truth? Yeah. So yeah, so the question is about, well, let me just say like everything you said is stuff that we have talked about at length and thought about, it is, yeah. I mean, the question is related to variation between high policies in terms not just of truth and falseness, but also in terms of sort of utility for society or whatever and also surprise level, basically. Like science and nature are interested in publishing, yes, useful findings, but also surprising findings. Things that were like, well, we always thought that was true and we just didn't have a good proof. That often doesn't get into things like nature and science. Unless it's something that really changes the game. Turns out that like, Tyrannosaurus could fly, that would be probably wrong. But it would probably still might go into nature. So yeah, I mean, both of these things do change the dynamics of what gets published where and what gets funded. And that, I think, to bring it back to the kinds of things we looked at in this model, I think that that kind of thing can change base rates of truth and falseness when we direct our searches to let's say where the funding dollars are or what we think is gonna be the big sell. No, I mean, I think it can change a lot of things. But that kind of thing, there's not, I don't think there's an easy parameter switch that relates to that kind of dynamic or the kind of situation that you're talking about in this model. I think that for the incentives model, one thing that we did talk about was, and again, we're still sort of exploring possible models for analysis there. But yeah, one thing we did talk about was the utility of, do we try to publish something that's really surprising? Do we try to submit it to somewhere that's really high impact given that we have a high probability of rejection, but a possible high payoff? Do we just do a new study rather than do all the revisions that the reviewers requested? And these are things that are gonna affect the dynamics of what gets published and where and when. And that's something that we've thought about but haven't looked at yet. So, well, I mean, on the surface, it does seem to imply that. And that might be true. Certainly, I mean, it's skewed, like Retraction Watch, the website has, they tend to feature things from nature and science and PNAS quite a lot in cell. Also, there was a thing, like a lot of these. But again, those are also the things that tend to be targeted for replication, the high impact. So there's a skew there. And are these things more likely to be, do we know that they're more likely to be false because they're actually more likely to be false or is it just because that's what we're looking? We're concentrating our replication. Probably they are more likely to be false than, let's say, high-impact but field-centric journals because they place a higher value on surprise. Oh, and just going to, oh. Oh. No. Yeah. So you could also imagine that so they have one of this kind of product is an analysis of cancer. Yeah. It's a long time without the partially-wilded product. Yeah. Yeah, I mean, so the answer to that question is clearly yes, right? I mean, yeah. Yeah, clearly there are, if we're talking about things where there are financial incentives and in private industry toward something being true, there's, I mean, look at climate change. Oh yeah, absolutely. I mean, we hear about, anecdotally, we hear all the time about people reviewing papers and trying to get them rejected when they disagree with their, you know, the reviewer's own hypothesis. So yeah, I mean, it's a problem. Well, I mean, what you're basically saying here is that someone has a vested interest in something not being shown to be the case. With the novel hypothesis, there's, let's just, you know, in the sort of idealized model version of it, there's zero information about it. So there might be the likelihood that somebody is basing their, whatever, livelihood on the assumption that such and such is the case when there's no evidence. Seems to be probably less of a fat, yeah, it would probably be less of a, you know, a strenuous feeling or whatever than someone having a vested interest in something that has been shown to be the case or might be the case. They might base, yeah, their livelihood on that one finding or whatever series of findings. And yes, of course they would then have, if you can say a point to evidence that your business is based on a reasonable model, everybody wants to be able to do that or, you know, many people want to be able to do that. And so yeah, I think the answer is yes, right? Like there probably is in some cases. I mean, but again, this is, yeah, there probably is in some cases more of a problem with incentive structures like this, influencing replications than in terms of novel findings. But there are probably counter examples too. So I can only speak generally. Okay, well, thanks so much. This was fun.