 Hello. Thank you for the opportunity to talk today. As I understand it, my role here today is to talk about what the field of meta science is and how best to grow the field. So to start off with, here are a couple of definitions of meta science that are in current circulation. The scientific study of science itself, the science of doing science. The risk of starting the talk on a sour note, I'm going to say I don't like these definitions very much. I don't like them because of the way they hide behind an appeal to method. I think the appeal here is meant to be to a quantitative and objective method. And my suspicion is that that is in itself an attempt to differentiate meta science from other disciplines that also study science, like the philosophy of science or the sociology of science. And I understand the motivation. It's not the attempt to demarcate our epistemic community from others that bothers me. Here are some more definitions. Research on the scientific process or the study of scientific norms, practices and cultures. These definitions do indeed describe meta science, but they equally describe philosophy of science and sociology of science too. And if meta science is to be distinguished from those fields, and I think it should be, then we need to add something to these definitions. But what we need to add is not an appeal to method. We've already heard from Samin Vizier at the start of this forum about some of the ways in which appeals to method fail to do the job. Plus, it's just kind of intellectually empty. We all know that there's no monolithic scientific method that we'll narrow going to follow in some sort of meta way. It's a promise that we simply can't deliver on. Most still is that definitions that rely on method put limits on how our meta science community should grow. Because these definitions only invite scientists into the program. And that simply won't do. We need ethicists, philosophers, librarians, sociologists, historians and more in this meta science enterprise with us. So to finish off these definitions, the study of scientific norms, practices and cultures, to finish them off so that they capture the work of the meta science community, but without appealing to method, I suggest adding the following phrases. The study of scientific norms, practices and cultures in service of science for the purpose of intervening and improving science. So what differentiates this community from others is not our methods, but our goals. We seek to correct errors, fix structural problems, improve credibility and public trust and so on. In contrast, sociology and philosophy of science describe and deconstruct. They are rarely prescriptive or interventionist. So now it follows from this new definition that there's a greater burden on us, meta scientists than there is on other fields, to understand our end game. If we're doing more than just describing and cataloging problems, if instead we are intervening in the service of science, then we had better have thought deeply and extensively about what better science looks like and how to measure progress towards it. We had better have an evaluation and monitoring research program. We also carry a responsibility to very carefully distinguish between means and ends. Means like pre-registration and replication and ends like accountability and trustworthiness. If we end up mistaking the means for ends in themselves, then we'll lose sight of how to make appropriate interventions in different contexts. We'll limit our ability to be able to find the relevant pathways to our ends in different disciplines. Another nice thing about this new definition is that we can quickly make it more inclusive by substituting science for research. So I guess the rest of this talk is really about the purpose and the scope of meta science. Meta science is a field whose primary responsibility in in my opinion at least is evaluation and monitoring. We evaluate and monitor open science initiatives and other interventions to improve science. We evaluate and monitor the impact of institutional reward structures and resource allocation methods. And we evaluate and monitor peer review processes. Both current operations and proposals for alternative models. This is not a comprehensive list of course but I think it does cover most of the current activity in the field. But there ought to be other things on this list. For example we ought to be running evaluation and monitoring programs for recommendations to improve statistics communication or proposals for improving methods education. If we make claims that Bayesian statistics or estimation or whatever it may be will improve our scientific judgment or speed up the progress of science then these things need to be on our evaluation and monitoring list. And all of this work is underpinned by the exploration of essentially contested concepts and by the labour of self-correction and error detection work. Later in the talk I'll explain those things and I'll talk about what happens when we don't make space for this under labour and what happens is bad. So running through all of this is this recurring question of where should the work be done? There's meta science work that perhaps should be done inside of primary disciplines in psychology, in medicine, in economics itself. Then there's other work that's perhaps best set up outside of those disciplinary homes in the way that sociology and philosophy of science sit outside the disciplines that they study. And then there's a work that requires interdisciplinary collaborations. And everywhere there are constant tensions, tensions between disciplinary relevance and independence and tensions between wanting to learn from other fields so that we don't reinvent the wheel and then the risk of overextending those lessons to places where they don't belong. Now I'm just going to quickly run through some examples of these different categories of work. This is a library of all of the articles that have ever been assigned an open science badge. It's colour coded by the badge type so you can see when an article has a pre-registration badge, when it has one of the others or when it has all of them. So this database now provides the infrastructure for several evaluation and monitoring projects. We can ask with it, we can ask questions like are articles with open data badges more likely to be computationally reproducible or are pre-registrations typically followed? Is there noticeably less p-hacking and cherry picking in articles that have these badges? Evaluation monitoring sounds easy but of course it isn't. First there's the problem that most initiatives are introduced without a time frame for evaluating their success. So it's the first round of articles with pre-registration badges aren't discernibly different from unregistered articles. Then do we give up on that intervention? That would seem wrong. We know it must take some time for people to learn the pre-registration ropes. So instead we keep monitoring and we make decisions about what a reasonable amount of progress is for the time that has lapsed. And if improvement doesn't come in time then it's the meta-scientist's responsibility to determine the next course of action. Do we need more education around the intervention? Do we need stronger incentives enforcement at an institutional level? Or do we abandon this program and start another one? In one of our projects we are investigating the open data badge specifically and we look at whether articles awarded this badge do in fact provide a link to data. Whether that data is in fact connected to the published article. Whether the variable labels in the data match the labels in the text. And whether we can reproduce basic calculations like the mean age of the participants and so on until we ask whether we can fully reproduce the whole thing. So we can with increasing levels of difficulty evaluate all of those aspects of computational reproducibility and in doing so we can measure the impact of the open science of the open data badge. But that all of that is really just one step. There are easier evaluation tasks like evaluating the uptake of the incentive. How many articles get open data badges over time. Then there are evaluation tasks like the one I just described for computational reproducibility. And then there are much harder ones like evaluating whether computational reproducibility is an effective means to improving accountability. We have to ask those more difficult questions to get beyond this to what else can achieve this end when computational reproducibility can't or when it isn't appropriate the research at hand. And if we skip that most difficult step then we're at risk of becoming circular or total logical of ending up with transparency for its own sake or to take a different example of ending up with research that's replicable yet trivial and uninteresting. Metascience also evaluates the impact of institutional reward structures. A recent survey by the Wellcome Trust found that 78% of researchers think that high levels of competition have created unkind and aggressive conditions in science. It identified creativity as one of the most common features of an ideal research culture but noted that three quarters of researchers felt creativity was currently being stifled by culture. And as Katie Corker's talk yesterday outlined there's a whole suite of metascience research questions related to the design, testing and implementation of new incentives and new models for distributing resources. I expect that examples of the latter like lottery schemes for funding allocation are well known in this crowd so I won't say anything more about this. The final set of metascience questions on my list focuses on peer review. As we know peer review carries so much of the burden for establishing what is trustworthy evidence and yet we really know very little about the practices that constitute it. In this survey of 280 journal editors across different fields we learned many things about journal policies and editorial practices. Things like how often journals run plagiarism checks or how often they employ blinding procedures. Another thing we learned is that many editors believe that it's okay to edit a reviewer's report. What does editing a reviewer's report mean? Well sometimes it means relatively uncontroversial things like editing our offensive language. 85% of editors think that that's okay so fair enough perhaps. About 40% think it's okay to edit a reviewer's report to take out the reviewer's identity, the reviewer's name. I guess maybe that's okay, maybe the reviewer signs their report but the journal has a double blind policy but it doesn't stop there. 20% of editors think that it's okay to edit a reviewer's report if they simply don't agree with what they said. Now I don't mean that they think it's okay to write their own statement saying that they disagree with the reviewer, rather they think it's okay to actually edit the reviewer's report itself to make it agree with their own judgment. Surveys like this help us understand what is going on behind the closed doors of peer review. They also help us gauge resistance to new proposals like open peer review for example and that in turn can help us think about what alternative models of peer review might look like. So one project that we're working on in this space at the moment is the replicats project, collaborative assessments for trustworthy science. Replicats is a platform for evaluating scientific articles. As the name suggests it's a collaborative group activity centered around evaluating the likely replicability of published research claims. Replicats is underpinned by the idea protocol so what we have here are just layers and layers of layers upon layers of acronyms and in this protocol reviewers first make private individual judgments about the research claim. They judge its comprehensibility, the prior plausibility of the underlying effect and the claims likely replicability. They enter these private judgments and their reasons or justifications for them into the platform and then after entering their own private judgments then the group members are shown the judgments and reasoning of other people in their reviewer group. So this provides them with information obviously but also with feedback for calibration. They can see how similar or dissimilar their own evaluations and reasoning are to other peoples and adjust accordingly. They then discuss these collective judgments as a group. They share information, they interrogate their differences, they explore counterfactuals and after the discussion there's a final opportunity for privately updating their own estimates. Importantly the process is not consensus driven. The quantitative or probability judgments that reviewers provide are mathematically aggregated into a final assessment and this aggregation can occur in many ways from taking the simplest average to using sophisticated models that weight based on the reviewers number of independent reasons for their judgment or the extent of their uncertainty as measured by the interval width they provide. At the moment the replicats platform exists primarily to predict replicability but now we're working to extend this into a more comprehensive peer review protocol and we expect that the advantages of this process over traditional peer review include the fact that it is inherently collaborative, that it has inbuilt training and calibration and that the feedback itself is intrinsically rewarding. It has a defined endpoint unlike many other interactive peer review processes which implicitly rely on consensus by fatigue and replicats captures quantitative and qualitative judgments in a way that encourages interrogation. It is transparent by design. So what's our monitoring and evaluation program here? The first phase of evaluation will occur through the DARPA school program where replicability predictions will be evaluated against actual replication studies. How to measure the protocol success is a more comprehensive peer review process is of course more complicated and that work will require collaborations with pre-print servers and journal editors. Now I said I'd return to this under labor and tell you about the bad things that happen if you don't do this work. These quotes are from James Heather's blog. James is a well-known error detective and self-pronounced data thug and here he's lamenting the fact that there is no in-case term for a scientific critic, that there's no disciplinary home for a person who investigates published science for demonstrable inaccuracy. This work has no name, he says. What happens when this self-correction or this error detection or replication work isn't named and isn't claimed by any field? What happens is that it ends up getting talked about like this. So it's no longer the legitimate underpinnings of meta science, it's no longer the essential labor that maintains science's self-correction mechanisms, instead it's a witch hunt. What a ridiculous way to interpret self-correction work. We cannot afford to not give this work a name or to not give it a home, we can't afford these kinds of interpretations. Another part of the under labor of meta science is exploring essentially contested concepts. This is really important too. Essentially contested concepts refer to concepts where there's widespread agreement on the value of the concept so we all agree that fairness is good or in the case of meta science that we want to improve research quality but where we do not have a grip, what we do not have agreement on is how to recognize, use, measure or evaluate those concepts. So like linguistic ambiguity, vagueness or underspecificity, essentially contested concepts often result in the kind of arguments where people talk past each other, where they're using the same words but they mean different things. It's a bit like conceptual slippage, only in the case of essentially contested concepts things are often further complicated by the fact that these are evaluative concepts, they deliver value judgments and essentially contested concepts are everywhere in meta science. They're even in places that are obviously normative like in concepts of replication or evidence. So sorting through all the ambiguity, vagueness and underspecificity to eventually get to the value structure behind these concepts underpins meta science or at least it should if we don't want to go off the rails. This is work that must be collaborative, it must be done with philosophers and sociologists of science and it's why I think an inclusive definition of meta science really matters. We write this work out of the program when we employ definitions of meta science like it's the science of doing science. This work isn't science but we won't come to any sophisticated understanding of our purpose in meta science without it. I'll quickly finish up now by just giving a few offering a few ideas of how I think this field should grow. First as I've said several times now we need to define meta science as an epistemic community by its purpose which is service not by the methods it employs. We need to understand how evaluation and monitoring research works. We need to carefully differentiate our means from our ends to find relevant pathways to those ends for different disciplines and we need to importantly distribute the labor and the under labor so we need to fund meta science within primary disciplines and we need to fund meta science outside those disciplines in secondary disciplines like history and philosophy of science science and technology studies research ethics and we need to fund meta science through interdisciplinary collaborations. Thank you.