 Hi. Thanks for joining us at this lightning talk session. I'm Fiona Fiddler. I'll be moderating this session. And we have six different lightning talks in this session. And I'll introduce them in turn. Each speaker will speak for no more than eight minutes. And if there's any time left over at the end, we'll have a few questions and answers, things you can pop questions in the Q&A as we go along. If you'd like and the speakers can try and answer them while they're not speaking. So the first talk in this session is called How Not to Measure Replication by Samuel Fletcher. So Samuel, do you want to share your slides, share your screen now? Thanks. How does that look, Fiona? Perfect. Great. Thank you. So the presentation that I'm going to be giving today is based on a paper which is now available to read at the European Journal for Philosophy of Science. You can find up in the upper right corner a URL that'll allow you to access a read-only copy that'll also have a DOI that you'll be able to copy if you want to download the paper. If you don't have access to it, you can just Google my name and email me and I'll have you to arrange for you to receive a copy. So this presentation is just really going to be an advertisement for some of the arguments that I made in the paper. The framing of the idea centers around a paper that I think many of us are already quite familiar with, 2015 Open Science Collaboration paper on Estimating the Reproducibility of Psychological Science. In that project, they used five different methods for estimating reproducibility or rates of replication for a variety of different experiments in social and cognitive psychology. We can see here a very nice diagram plotting the replication effect size for many of these against the original effect size with very nice density plots showing the contrast between the effect sizes where the color codes denote ones that were originally significant compared with ones that were only significant in the replication that red ones of course were the ones that are not significant. Now in doing so they used five different methods for estimating the rate of replication. A question that I wanted to ask is how acceptable conceptually are each method. Now in doing so my intention is not to pick on or particularly criticize the Open Science Collaboration. My reading of their intention is to that they had selected methods that are used in different parts of the replication community. So I take this to be more of a criticism of the general frameworks that the community is often adopted for trying to assess rates of replication. I'm not going to be critical of all of them. I'll have one that I want to argue for is probably the best adapted to our interests. Okay just to start with the first one the first criterion is used is subjective assessment. So the replication team answers the question did your results replicate the original effect? There's a lot to say about this. I'm just going to focus on one kind of conceptual distinction we might want to make about this particular type of assessment that we can consider it either a proxy for a quantitative method or not. What I mean by that is that in asking the replication team this question asking them to use their judgment or to assess replication where they understand that as a stand-in for some quantitative criteria. If that's not intended then it raises a question about why I believe some non-proxy subjective assessment is actually supposed to be a good measure of that. That is to say why it taps into some deep sense of expert judgment perhaps. The literature on expert judgment indicates that expert judgment is crafted through the width of direct experience and feedback and this doesn't seem to be like it's a situation that describes the sort of subjective assessment that the replication teams would have to undergo that they would have to make. But if the subjective assessment is a proxy for a more quantitative method then why not use the quantitative method instead? Another popular method is determining whether or not the replication results in the same outcome in a null hypothesis significant test. There's many ways to criticize this. Here's a few examples. One is that it's very easy to replicate null findings. In this diagram here I've kind of given a cartoon of a effect size plot that you might see. The center dotted line here represents the effect size of zero and the paired. Rs indicates outcomes of experiments. They are confidence intervals with the dot indicating the best point estimate. So in this example here we have a null finding. It becomes very easy to replicate that null finding just by collecting very very little data so that you have a very very large confidence interval. The resulting test will probably not be significant. Another issue is that it becomes easy to replicate the data that actually indicate very different effect sizes. We see this here. Both of these are experimental results that result in rejection of the null hypothesis but they seem to indicate quite different states of nature. And finally it's also very easy to get non-replication with very similar data having to do with small differences in the resulting confidence intervals whether they overlap the zero effect size or not. So these seem like they indicate similar evidence for the effect size but this we count as a non-replication. Another type that's used is effect size comparison. One runs a hypothesis test on whether the resulting point estimates are significantly different. However, this can be easy to replicate false positives. If an original experiment has only a slightly significant test then it can be very easy to replicate that. And also depending on whether the original in this case for example depicted here is interpreted as a positive result we can get an asymmetric judgment. What I mean by that is that if this one is considered to be indicating the existence of effect and this one is not and depending on which one is the original we can get a different result as to whether or not no effect has been replicated or in effect has been replicated. Similarly the question of whether or not the effect size is within the replicated confidence interval also has this asymmetry problem. In situations where we have asymmetric confidence intervals it can happen that one study replicates the other but not vice versa. Of course these also don't account for the uncertainty of effect size estimates either. The last one that they used I think is probably the best option that we have. The last option is using meta-analysis. So instead of asking the question has this result been replicated meta-analytic methods combine the evidence from multiple sources to answer the question what is the total best evidence for the size of an effect? One common criticism of meta-analysis is that it is susceptible to publication bias. That's true but this is something that every replication measure must contend with. It's not something which is unique to meta-analysis. I do think meta-analysis has the best chance of being the best measure of replication and you can go ahead and read more about that in the paper that I published. Thanks for your kind attention. Great thanks. That was really interesting and excuse my own questions going into chat instead of the Q&A. This is a technical problem with being the host is that I can't put things in Q&A. We will move straight on to the next talk as I said and if there's time for questions at the end we'll do it then but please feel free if you do see questions pop up in Q&A and for those of you attending please do put questions in there and then speakers can respond to them as other presenters are giving their talks. Our next talk in this session is called Examining Universal Design Profissions in Open Science Journals by Caitlin Stack Whitney. Can you see my screen just to check? Thank you just checking you never know. Hi I'm Caitlin and I do want to mention that I'm speaking with you today but this is work that's in collaboration with two of my colleagues at Kent State University Christy Belay and Julia Perone and an undergraduate researcher who works with me at RIT. So today I'm talking about universal design principles in open science journals and I was excited that in the Q&A in the last panel that was happening earlier this evening they were talking about the role of journals and I think it's really important for us to think about journals and the role of them maintaining and potentially changing scientific culture. So I will be talking about journals today. Okay so I'm thinking about open science in particular and open science means a lot of things this is a graphic showing that under the open science umbrella one way of thinking about open science is that it has these sort of three elements to it open data open access and citizen science and so what I'm talking about today is this open access component. So the term open access means a lot of things to many different people so some people mean cost as in the software or the hardware is cheaper or potentially free or that more people can use it or that the learning curve to potentially using some of these tools is lower maybe it's more easier to learn or it's not proprietary or it might mean the data is accessible as in it's hosted online or it's not behind a paywall but access and accessibility also means disability accessibility so accessibility as in being easily accessible to people with disabilities. So the question that our team wanted to answer is how many open access journals does open science broadly consider and provide disability accessibility as part of open access under the banner of open science and so we aim to explore this by doing a content analysis of the author submission guidelines for manuscripts and also the open statements the vision statements of journals and we did this for 300 open journals so we started from the directory of open access journals which has over 15,000 journals database there and we used a random number generator to pick 300 up English language open journals so then we analyze the author guidelines and open access statements and in particular for this case analysis we focused specifically on image accessibility so there are many dimensions to accessibility and access looks different for everyone but in the title read I mentioned universal design principles so there are some known best practices for making things more accessible and so in the case of thinking about image accessibility we're talking about things like image descriptions and alternative text captions legends high contrast colorblind friendly pallets things like that and then in the open access statements we're looking for really any mention of accessibility including disability accessibility or access and inclusion so I'm going to show you what we found for looking at these 300 journals and whether or not they included first disability accessibility in their author submission guidelines for manuscripts I'm going to be showing you a bar chart and it is high contrast but I will point out that the materials that are posted after this have all of the image descriptions for them so I'll walk through the table but we also have it up as a table and a blog post it's in the last slide okay so in this bar chart that I'm showing you basically the way to read this is that if the journal had this element in their guidelines the bar is black and if it didn't have it the bar is gray so in the one to the left what you're seeing is that out of 300 journals most of the journals over two thirds of them had some requirements for images they had some kind of guidelines they required of authors but what you can see in the last three columns there is that number one literally no journal of the 300 that we survey required alternative text for images none of them and very few of them required that you have high contrast images to make them more accessible the only element that really saw any evidence of around 30 of these journals so about 10 of these journals included information and consideration of color choice of things like colorblind friendly palettes in looking more broadly at how journals thought about open access in their own words in the statements they were providing about their journals and their commitment to open again the similar color scheme here that journals that had it the bars is black and then the stacked bar if it's gray it means it doesn't have it so in that column on the left you're seeing that most of the journals over two thirds of them said that they had accessibility as part of their open access consideration but again this speaks to discrepancy between that last slide and this one the other two columns I'll speak to is that they mean access in a really different way they're not including disability accessibility in their vision of open access currently so there was no mention of disabled or disability or inclusion or inclusiveness in any of these 300 journals all right I do want to point out that with my collaborators this is something that we're doing as part of a larger project that's focused on reproducible science in mostly the environmental sciences and we're really considering that one of the things we are finding is that accessibility is a real barrier to doing more reproducible and open science and it's not a small one and so I want to point out that in part for myself doing this work especially this piece of it was inspired by two colleagues who are deaf and disabled scientists who work in open science and personally experience and and witness a lot of barriers to others participating in open science and so I put up here that you can learn more about our broader project and some of the places that we're sharing this so our project we share our grant application we share all more materials and project in progress um Dr. Hare has been one of the proponents of pointing out that many of the open science tools are not accessible to screen reader users and that really there is no open science if it's not going to be accessible to all users and all creators and the last thing I'll mention is if this is really new information to you you can learn lots more about image accessibility from disabled scientists and disability advocates it's really easy to put in all ticks and image descriptions you don't need to wait for journals to require you to do it and if you're on a journal editorial board this is something you can go back and talk to them about right now so I will stop there for now. All right thank you um remember again I'll just remind you one more time to put your add some questions to that Q&A panel for our speakers so they have something to do and our next talk is by David Reinstein has a very long title let me try and read this moving science beyond closed binary static journals a proposed alternative how the effective altruist global priorities of the non-profit sector can make this happen okay thanks David David you're on mute can you try again okay now again okay that's such a great introduction um but you all didn't see it oh wrong button okay uh speaking of journals I don't think we need them per se I think we're fine and we can do things on our own and I have a discussion of what I think we should do and here's if you can't remember if you don't get the link you can you can look at my thoughts at bit.ly own journal um oh why did it uh can you still see my screen well okay first slide's back do you want to try it yeah let's let's try that again okay I don't know why it's it's acting up today um okay so here are some things that I think are problematic with with the journal system at least now this is my experience in economics I think it describes economics and some related fields uh policy and social science so it's a tedious process and it's a process that everyone constantly complains about and it wastes a lot of resources and here I'm not just talking about journals but I'm talking about the idea that you submit to a journal you wait and then hopefully for Christmas you get something positive from them but maybe nothing at all and then you feel sad um so I think that the now I feel like I'm preaching to the choir here a bit with this so I won't go into length but basically I think there's three major costs to the current system we have one which is I think not that great but it exists is that journals are charging a bunch of money it's not a huge magnitude in comparison to costs but the access issue is important in some contexts um the second which I think is more important and I think that this audience probably appreciates is that a pdf by itself is limiting it's a barrier to transparency and robustness it's very hard to publish a dynamic document um and it's a limit to innovation but the one I think is most important is just the way that the journal process works is just wasting a lot of time and effort and maybe in economics this is an extreme maybe it's less so in other fields but I think that there's limited feedback and there's limited quality control and it could be done so much better the main thing that's a waste is this whole dance of you know I submit like if you go to any lunchtime or coffee shop uh coffee table coffee pot you'll see just all people talking about is oh which publication should I submit to how can I get it in oh they're having a special issue oh you know this and that's the editor oh blah blah that's not science that's a waste of time go to twitter if you feel like punishing yourself and look this stuff up um so I think what we need is to evaluate work I think peer review is great we need scrutiny but I don't I think the whole except rejecting is a waste of time I think that you should submit your paper somewhere or get for your project somewhere think of the the paper as the tip of the iceberg and I think you should get feedback and a rating and then if you want to improve it you can uh continue to work on the paper um so basically there's two elements here one is where is the paper put or where is the project put um particularly thinking of dynamic documents uh and the other is what um how does the review happen who motivates people to do reviews how does that system work um and I think it needs to be quantifiable incredible and once we have that this system can completely replace journals we don't need all these different journals quote publishing stuff and printing it and having it stored in the back of the library um okay but until now if you mention this people will say yeah okay I would love to do this new system but you know my department won't let me do that or you know I won't get tenure or I won't get this grant or the metrics the REF doesn't value this and that but you know we're the one setting these rules we have the power but it is a collective action problem and I think one thing that can help get us out of this collective action problem is there's some organizations that are not tied to these traditional systems that really care about the results are willing to fund it really care about the science and I think I work at such an organization it's called rethink priorities a lot of our funding comes from something called open philanthropy and I think there's ways we can get out of this collective action problem um one is that the presence of these other organizations who are not traditional academic organizations that are willing to put credibility in uh these other reviewing processes these other pros processes of quote publishing work I think conditional pledges are extremely important check out for your knowledge dot org and I also propose if you're not willing to submit your work to the unjournal these escape bridges um so the conditional pledges you might be aware of some of these pledged to publish only in open access journals I also like the pledge to submit all of my peer reviews that I receive and then if I submit if I post all the peer reviews I receive and if I've committed to that well that's credibility in and of itself for my paper you can take this pledge here um another thing you could do is you could post all your public reviews and just commission an outside expert to independently assign a quantitative rating for these reviews if you have to submit to traditional academic journals um and there's tools that I think can help us do this the one of which is so on the on the side of where to host things well there's many places to host things many places to host dynamic documents and projects and even get a doi for dynamic documents web pages projects not just the tip of the iceberg paper which often isn't the best way to communicate research um and ensure transparency the other half of this is how do you enable evaluation rating and feedback other than through journals well I think pre-review is a very valuable uh is very good interface for this research hub is also promising um and I'm curious to hear what people think about these different um tools I've given a little uh rundown of my opinions on the potential of these tools in an air table that's linked in the slides which are shared uh just to give you a picture of pre-review if you're not familiar with it um basically they don't host the pre-prints they just say you can either review a pre-print or even a publication as long as it has a doi or you can ask your pre-print publication um plans for an experiment to be reviewed I would say that the limitations are there are not that many pre-prints up yet and there are not many reviews yet they're scaling up they're gearing up I think we need to put some of our money and our resources into providing incentives for people to do reviews on platforms like these um so are there these reviews quantitative not entirely although there's quantitative measures from in my book I think we need more of a way of scoring papers a way of rating and ranking them uh particularly in disciplines like social science where there's not just a clear gold standard for this is a good piece of evidence this is not uh but I think this is a good move in this direction um okay I'm nearly out of time I believe um so I also mentioned research hub um I mentioned a route for organizations like mine for what we should be doing uh what we should be cultivating we for various reasons we don't we don't have as much patience to follow the traditional academic submit to this journal wait six months as economics here great reviews from two people negative from the other end up with zero we need other routes towards credibility and feedback and I think by us and organizations like open philanthropy putting their resources and their credibility into this I think it can start to tip the momentum towards let's not publish in lockdown traditional journals let's put our working papers up on archives and get independent credible reviews of that work okay so I think I'm pretty out of time but here's my call to action I want you to participate in these things to take conditional pledges if you can commit to posting all reviews of your work there's a pledge there to maintain living projects I mean preaching to the choir here this is the meta science very open sciency and it's campy enough to be oh I don't I don't want to take a risk to do this because no one else is doing it there has to be rewards to people who are the first movers so that you have a fear of missing out for not taking these quote unquote risks and finally I think that public interest open science open philanthropy should support this mode rather than the traditional journal mode and I think not only will it be better for you for us it'll also be appreciated by academics and academics will love you for it and help you because of it okay so that's my presentation sorry if I went over the time there okay thank you the next presentation in our session is by Nicholas Otis this talk is called forecasting in the field evidence from policy experiments thanks Nicholas okay great so thank you to the organizers for the opportunity to present today this project asks whether academics can accurately forecast the results of large scale policy experiments this question is important because if accurate we can use people's predictions to select and test better policies I collect predictions of results from three large scale pre-registered randomized controlled trials in Kenya these trials evaluate a diverse range of treatments from cash transfer interventions to a variety of different mental health interventions you can read more about the details of all of these interventions in the working paper on my website so in total I collect 2100 predictions from 135 academics of 50 causal effects so to pin down ideas let's consider one policy experiment which tested the effect of psychotherapy versus a control condition in a field experiment in rural Kenya the experiment found that therapy actually increased depression by 0.05 standard deviations though this effect wasn't significant and the main prediction both by economists and clinical psychologists was that that therapy decreases depression by 0.12 standard deviations so the absolute error on the mean prediction is just 0.17 standard deviations so that's one of the 50 predicted effects we can calculate a summary measure of accuracy simply by calculating the mean forecast for each effect calculating the absolute difference between the predicted effect and the observed effect and then calculating the average absolute error across all 50 of the effects so this leads to the first result which is that the mean forecast is actually quite accurate the absolute error is only 0.09 standard deviations as depicted in this red vertical line in the histogram of absolute error and over 50 percent of the predictions are less than 0.05 standard deviations away from the true experimental effect so academics are on average quite good at predicting the effects of policies how does the accuracy of individuals compare to the average or a wisdom of the crowd accuracy that I've presented so far consistent with previous research I find that the mean prediction performs substantially better than individual predictions the average absolute error for individual forecasts which is depicted by this gray bar on the left is over 60 percent larger than the error from the mean predictions in other words there are large accuracy improvements from asking many people and then averaging their predictions together so now let's try to put these ideas to work as a stylized example let's consider a situation where we're trying to determine which policy to implement or which policy to test so one way to choose which policy to test is to ask people but how many people should we ask is it sufficient to ask just a few people or should we be asking many people we're going to consider a choice between sets of policies which were tested within the same trial so for example the trial with the psychotherapy intervention which I mentioned earlier also included a benchmark cash transfer intervention in the same experiment we're going to assign a value of one if the forecasts rank the interventions correctly and zero if they do not rank them correctly so if my forecasts correctly predict which interventions will perform better we're going to say that the forecast yield the correct policy choice then we're going to compare the policy choices made by crowds of size one to 15 where a crowd prediction is just the mean prediction and crowds are generated through bootstrap samples uh from the total sample of forecasters so here's the result from uh from crowds of size one this is just the proportion of interventions that individuals rank correctly which is 57 so this is you know they're doing better than chance but still not great but we can see that there are substantial returns to crowd predictions the benefit from asking additional forecasters is increasing though concave and by the time we're asking 15 people we're getting the rankings right about 74 percent of the time so this is a substantial improvement in policy choice and I think this is a very encouraging result it provides the first large-scale evidence that crowd predictions can meaningfully improve policy choice so thank you for listening and please visit my website if you're interested in learning more about this project thank you sorry I lost my screen for a minute there thank you now the next talk is by Wes Boniface the clandestine operations of complexity in statistical modeling okay thank you very much so I'm going to jump right into it a key component of any scientific undertaking is the construction of a model that explains the data but no model is an exact representation by definition of the phenomena that we're interested in investigating and especially in psychology useful models are often simplistic approximations of immensely complex processes so it's necessary to evaluate a given model to judge its characteristics to investigate its nuances and critique its flaws and centuries of scientific reasoning have led to supposedly different schools of thought about model evaluation but despite these different philosophies and techniques each school has the same goal which is that model should be of use but what defines a useful model among many available answers to this question Myung Pit and Kim in 2005 proposed several quantitative criteria and these included goodness of fit generalizability and model complexity while goodness of fit addresses the closeness of the model to the observed data something that we're probably all familiar with generalizability as a measure of how well a model will fit unseen data samples that have been or will be generated by the same underlying process but in different samples and then following the principle of Occam's razor a model should not only fit the data well but it should do so in the simplest manner possible and so these authors define complexity as a model's inherent flexibility that enables it to fit a wide range of data patterns and notice that's a general definition it doesn't mention the source of this flexibility and that's something that we're going to talk about here in this this conversation so importantly these three properties goodness of fit generalizability and complexity they're not independent all three criteria impact one another specifically as a model becomes more complex goodness of fit to the observed data will necessarily increase right we're all familiar with this phenomenon generalizability to future data will also increase but only up to a certain point beyond that point the model becomes too complex it's explained too much of the noise that's particular to the observed sample data and therefore that model will be less generalizable so when the generalizability begins to wane the model is said to be overfitting the data so consider the example data here shown in this panel on the bottom right we have seven data points and if we fit a simple linear model this model on the left to these seven data points generalizability would be adequate right but it's not stellar and goodness of fit would be maybe unimpressive right the model is too simplistic it's capturing the downward trend in the seven data points but it's not really closely fitting any of those data points a very flexible model like the one shown on the right would achieve excellent fit but this comes at the cost of excessive parameterization there are too many parameters and so this model is unlikely to generalize to other data sets by overemphasizing the role of the observed data this model would serve little use in in future samples or other research scenarios but this curvilinear model in the middle is a compromise it would provide ideal fit and maximum generalizability and note that ideal fit doesn't mean perfect fit it means ideal in that you're not overfitting the data so it's right at that sweet spot where you're not over explaining your sample data but you are capturing the trend that's likely to generalize that's likely to represent your theory and so the curvilinear model by deemphasizing this exact fit to the observed data and concentrating instead on this underlying trend would therefore be deemed the most useful of these three models okay now these relationships shown in this figure are not new in fact this entire figure is borrowed from young and pit 2002 as cited there in the top left but what might be less well known is that there are two factors that are known to influence this model complexity the first of course is the number of freely estimated parameters in the model so suppose that our least complex model the blue linear the line model on the left would be represented by a formula like this simple linear regression formula we have one intercept term and we have one regression predictor right so we just have two parameters are estimated b0 and b1 we could increase the complexity of this model by adding parameters right we could get a quadratic model like the one shown here that might represent the curvilinear model in the middle or we could increase it to a highly parameterized polynomial model and there are proofs that show if you have a certain number of a certain value in that exponent you can achieve perfect fit to any data set so that's one way that we could increase the complexity of this simple linear regression model and then if we want to evaluate that complexity we know about things like aic and vic that include this penalty for including too many parameters in your model now what might be less familiar to people and again this is not anything new it's just less familiar is that you can also increase the complexity of a model without changing the number of parameters so you can have the same number of parameters you can do things like this exponential function you still only have a b0 and a b1 that are estimated but you're going to have this curvilinear function that might fit the data might capture the trend that you're interested in you can have this sinusoidal function where you get perfect fit you still only have two parameters you have this very flexible model here now evaluating this form of complexity is very difficult some people have called this hidden complexity i reminded myself when i looked at the the conference that i had called this the clandestine operations of complexity a title that i was surprised i had actually submitted but the idea is that there are hidden forms of complexity and it's difficult to evaluate those and to uncover them but one promising method is known as the minimum description length principle this is a very um sort of technical approach and so i'm not going to get into it in this lightning talk i'll have some further reading suggestions that you can consult to better understand that but i do want to show a brief example to illustrate how this this works how the reasoning of this minimum description length principle can help us to investigate this hidden source of model complexity so in a paper from a few years ago i with a colleague examined five psychometric models and i'm not i'm just going to zip past the details of these but we have an exploratory factor analysis model a confirmatory by factor model and then a couple of these like latent class type models and these are used in like educational measurement contexts and importantly these four models were chosen in this here because they all have exactly 20 parameters right so they have the same number of parameters by that traditional view of complexity they should all fit equally well we also included a unidimensional item response theory model but this was specified so that at each of the seven items you actually have three parameters and so a total of 21 you actually have an extra parameter in that model so under the traditional view an extra parameter provides more flexibility and so this model should be better at fitting data or overfitting data even we fit these models to a thousand random datasets and just looked at the goodness of fit criteria across all thousand datasets and we can use like 0.05 as a cutoff for this fit statistic and so what this plot shows is that for the unidimensional model for example it only fit well to two out of a thousand random datasets right so remember our general definition of complexity was a model's inherent tendency to fit well to a diverse range of data patterns well these are a thousand random data patterns and a unidimensional model wasn't very good at fitting them because it's too simplistic these latent class models fit about five percent of these random datasets but look at this 64 percent of by factor models fit well to random data right so 635 out of a thousand datasets obtain this goodness if it's statistic a 0.05 or lower when you fit to random data and these are totally nonsensical data factor loadings all over the place they make no sense but you still get good fit with that model almost as well as you would do with an exploratory model so the by factor model is confirmatory it's usually based on theory and people use the good fit of it to confirm a theory and to say this model supports that our theory is correct but the model performs almost the same way that an exploratory model performs so essentially it is mining the data and giving you a good fit that you are then using to make substantive claims about your findings so these results illustrate the presence and the effects of this hidden complexity the clandestine operations of this complexity chris preacher was one of the main forces for putting this work out there and he called this effect fitting propensity and certain models like the by factor model have an inherent tendency to fit well to any possible data the unidimensional model in our example had an extra parameter but it actually was way less likely to fit well right so the point there is that you can't just count the number of parameters that's not going to help you to identify whether your model is fitting too well or not so we have very strong implications for using goodness of fit statistics to make claims about our models to evaluate them and for models that tend to fit well overfitting is a statistical inevitability we don't want to overfit our data because we want to find the generalizable trends but we're not going to do that if we just rely on goodness to fit most importantly for the meta scientists in this symposium successfully replicating the good fit of a model with high fitting propensity is a hollow victory you think that you've made progress but you haven't really because you are using a model that will always fit well in that case it's not helpful to claim that you found something because you got good fit you're using a tool that always fits well so you're not able to support those results so I'll conclude with the same conclusion that preacher had in his 2006 paper cherished models may have to be abandoned or replaced he said if their past successes can be ascribed more to fitting propensity than to any insight they lend into the process that actually generated the data so if our goal when we talk about replication is to find things that matter and things that are real and to actually establish that our theories are correct and testable and successful we can't just rely on goodness to fit statistics here's some reading that will be posted for you to to check out and thank you very much okay thank you and we have one final talk in this session and I'm afraid that it's at a very bad time for this presenter Bianca Trovo who and so Bianca has sent us a video instead and the title of the talk is reimagining value in academia through the blockchain ants review a decentralized platform for incentivizing scientific peer reviews okay so bear with me while I attempt to um to play this video okay right let me know can everyone see that someone let me know yes okay let's see how the sound goes hello everybody I am Bianca Trovo and I will present ants review an independent project that I conceived for incentivizing peer reviews through the blockchain technology but first let me walk you through the concept of value in the current research economy when we talk about value we immediately think of money and what has been valued for the past 20 years is data we're currently seeing a transition from a centralized internet or web 2 the one born with social media characterized by an economy of value extraction and that exploitation to a decentralized internet or web 3 the one born with blockchain characterized by a token economy in which users are also owners of their data so when it comes to science where does the value lie what gives value to a scientific paper affecting its credibility and citation impact is the peer review process the gold standard by which the scientific community reaches consensus and science is published from its beginning though peer review relies on a very centralized model still this model behind scientific publishing and scientific consensus is considered more and more inefficient too slow to time consuming non-objective not inclusive enough in broadly speaking situated unbroken reward system in which peer reviews are not remunerated nor openly recognized by the academic community as a relevant scientific output for researchers and this can be reconducted to the centralization of science around the profits of big academic publishers who don't directly do the science but capitalize or misaligned by national incentives you can say we're still stuck in a web to science where the value flow is extracted by few aggregators that give no tangible recognition to those who produce it however science per se already implies decentralization in the peer review validation process in which scientific papers like blocks on the blockchain are verified in order to be added to the network and this verification process doesn't need central authorities a sort of decentralized model of peer review is already available for the spontaneous validation of preprints in the form of community comments and Twitter endorsements but this happens through centralized platforms that monetize its value so the value is dispersed so how do we capture value in science in peer review we saw that peer reviews add value to papers this value needs to be recognized two more premises recognize value with sub reputation reputation increases trust blockchains are trust machines and incentives machines they allows us to program value to money through verifiable computations giving sovereignty back to individuals and communities so a proposed answer view a decentralized platform that incentivizes community driven peer reviews and empowers reviewers identified three focal points to improve peer review transparency double anonymity and recognition of peer reviews it takes to blockchain answer view can at the same time guarantee transparency and anonymity by fixing accessibility problems misconduct to review quality minimizing biases towards identity and helping onboarding young researchers accelerating the review process and motivating reviewers transparency and open peer review are achieved through IPFS and a centralized file system where data are uploaded through permanent timestamp commits an authorization process called proof of existence double anonymity is achieved through cryptographic methods such as zero knowledge proof enomorphic encryption that guarantee privacy and accountability how does this how does this interview the incentivization work authors submit a bounty call asking for their footprint to be peer reviewed in exchange of of a monetary reward anonymous reviewers submit their reviews and those are voted by the scientific community based on their quality and added value reviewers are then proportionally paid and the more tokens they gain the higher their reputation is on the platform signifying their research impact tokens also have both in power and allow for governance on the protocol to get a better idea you can try our platform online if you have a crypto wallet installed on your browser and I invite you to check out our paper as well published this year to sum up the scientific reward system is not aligned with with its value flow but blockchain technology can fix this by empowering back its creators and users this way we can move from a centralized science to a decentralized science to be sustainable open science needs economic incentives for blockchain it's possible to combine open peer review and double blind review and through view solves the lack of recognition of peer reviews value for an incentivization system that beats our reputation thanks for having me here and I would like to thank also my co-authors on our sponsors good thanks so I hope that video is clear I apologize it just wouldn't be an online conference in a pandemic if you couldn't hear noisy children in the background right so I'm happy to have provided that from my house and I wasn't able to tell them to be quiet because then you wouldn't have been able to hear the video so apologies Bianca when you're watching this video maybe they'll be able to do something about it in post production right so I think we have just maybe three or four minutes for questions now which is good so I can see there was one question to Nicholas in the Q&A which are you set do you feel that one's been answered or would you like do you want to say something further about that Nicholas? Yeah sure so I guess the question is asking to what extent predictions are valuable if there's kind of already consensus around the effectiveness of a policy and so I guess one of the things that I didn't present is there's actually tremendous heterogeneity in people's predictions so people you know the means do well but the spread is pretty wide and for a lot of the interventions actually I think the results empirically appear to be quite surprising so people you know people can do pretty good at ranking interventions the mean prediction does pretty well but lots of people are way off on sort of interventions that were presumed to have a large effect like psychotherapy right people you know meta-analysis suggests you know modest but significant effects of psychotherapy and this is a large trial in Kenya finding that doesn't seem like there is a substantial effect and I think they're in the paper they go into lots of reasons why that that could be the case and so you know I think the part of the value of predictions is to show how much new information studies are providing you know if people predict a large effect and the effect is smaller is that kind of has a different sign and that's that indicates a lot of information and then I think the kind of other response is I think it's going to be in general hard to get a relative ranking of two policies from a meta-analysis because oftentimes we're trying to compare you know results in one context results in a in another context and this seems to be a situation where the the forecast that kind of aggregate forecasts really perform well in terms of providing a relative ranking over two two policies. Thanks I can also see in the chat there's an interesting kind of debate brewing between Richard Wien and David our presenter about whether what exactly the extra value of journals might be and Richard and David have both expressed their views about that in the chat I'm wondering if any of our other panelists would like to weigh in on that question. I'll just jump in and add that one of the reasons I really enjoyed the last presentation was double anonymous reviews are often uncommon in the sciences but more common in the humanities and social sciences and many people who are minoritized experience really hateful comments in peer review and so I think that's an important consideration to think about how opening up the review process might make that better or worse so I'm really making sure that we're troubling these ideas of merit and what floats to the top through any of these sorts of processes. That's an important point I think would anyone like to respond to that? So beyond double blind review there's also triple blind review where the associate editor who's handling the manuscript also just know the identity of this so the British journal for the philosophy of science for instance uses triple blind review which adds a bit of back end complexity but not anything that some good programming can't help us. We still have a minute or two left and not much more any final comments from our panelists on it on a topic of your choosing? Panelists, speakers, presenters sorry not sure appropriate term is an opportunity to have the last word. Could we possibly just agree on maybe in case anyone does have questions should we go to Slack or should we go to that floor one table one of the whatever that place was called Remo if we could just coordinate on that floor one table one of Remo if anyone has any thoughts or comments is that okay to coordinate it there? Yeah that sounds good yeah so if you're planning on heading over to Remo now the link is in the chat already I'll just um I'll repost that in case it's gone too far up the thing for you and Slack is there all the time and open all the time you can check that at your leisure but for now I guess those of you who would like to come over to Remo for further discussion floor one table one is that what you just said David yeah okay great thanks a lot. All right well I'll um say thank you again to our presenters in this lightning talk session and we'll see you at another session soon. Thank you. Thank you all. Thank you.