 So I'll start with the motivation, because the reason we're doing this research is because we think that research that is publicly funded should in some way benefit society in the world. So not immediately necessarily, but somehow we should make society better. We're not going to defend that in any way, it's just what we're assuming. Just to be clear about that, and once you assume that, I think our question follows quite naturally. Namely, is it actually the case that in the humanities what we are doing is society better? And the reason we're investigating it is that there are quite a few people who have raised doubts about this. Quite a few prominent people, so Kicher is one of them, and he describes a lot of contemporary philosophy as a scholastic indulgence, self-indulgence for a few. Then it also has a nice one, he thinks that many projects in contemporary philosophy are artifactual puzzles of lower-binding significance, and then Yintzi and Pogosian, if that's how you pronounce his name, just because the code is nice, they think that philosophy twiddles away and seriously entertains the hydrographic and inconsequential to attend philosophy conferences to marvel at the obscurity and the relevance of what's become of the discipline. So those are pretty damned in words, they're a bit strong here and there, but they do raise a legitimate question, I think. Problem is that it's not very easy to measure societal relevance, to check whether research is society relevant. There's the easy way to go, which is to use hot metrics. There's lots of data available, it's very easy to gather and analyze. Problem is that we don't really know what it means, it's probably not very reliable, so the number of Facebook likes or tweets or Google Plus mentions that exists, they probably say very little about societal relevance, so that's not the best way to go. Sorry, so a better way to go is to do in-depth analysis of research projects, just really look at what they did, how it impacted society, which probably works great but isn't scalable, like we want to look at this at a large scale, but that's just not feasible. So the way that most people go by necessity in funding agencies or institutions is they just ask other researchers to read a summary of the project and then decide whether it's societal relevance, so they just review. And that's what we've investigated in this research project, the use of peer review for measurements, so these are research questions, to what extent is research in humanity society relevant? We just want to try to get a very rough answer to that question. Are there differences between subfields in the humanities? And I find the experevue useful for this. This is the structure of the talk, I'll first briefly say something about this notion of societal relevance. Then I'll talk about the methods and the results in this question. So first is the notion of societal relevance. I just want to start with a disclaimer because I don't want to make anyone angry that we don't claim that relevance is the same as value. So I'm assuming that most if not all research in the humanities is somehow valuable, it's going to be interesting or thought provoking or it's going to make the researcher does it happy. Maybe not your thesis, Max. But that's not what we mean by societal relevance. It has to be more limited because some of the things that are really valuable aren't relevant to our society. And a productive way for thinking about this is to go via research funding. So we'll just be assuming that societal relevance is just whatever is worth funding given the funds that we have. You get some kind of a relative notion of relevance, where you try to rank the projects and whatever is above the funding line is sublimely relevant. That's still very vague. It doesn't give us very much to go by, but it gives us something. We can use Kitcher's ideas about an ideal committee and this idea of well-ordered science. I assume most of you will have heard about it. So just very briefly, it's a thought experiment by Kitcher that he proposes to evaluate the research agenda of science. Is science actually investigating what it should investigate? Should it investigate according to Kitcher? We should investigate whatever would be chosen by an ideal committee. So it's just a thought experiment. It's not a real committee. It's a way of thinking about what we should be doing. This ideal committee consists of value-informed deliberators, so they know what they're talking about, what they're evaluating. They should perfectly represent all segments of society, not just one country, but all of humanity, now and in the future. So all interests should be represented. And they should also be charitable. So try to understand each other, try to take each other's preferences into account, and whatever they would choose is what is society. So of course lots of problems with this thought experiment. It's probably undetermined what they would choose. If it weren't, it's probably impossible to know what they would choose. But we do think that it gives us something to go by, and maybe intuition that what is society relevant is somehow related to this democratic decision of all societies. Whatever we choose, it should be something that in some sense the majority would choose. So that's what we've used to reformulate our research questions. The first research question then becomes, to what extent are scholars in humanities currently investigating what the ideal committee would choose? Second research question is still about differences between the fields in that sense, and then finally, should we expect peer reviewers to select research that an ideal committee would select? So those are the questions. The way we've investigated this is we've hired expert writers to read a bunch of abstracts and evaluate whether they think they would be chosen by an ideal committee. I'll explain this further later. Just do all the clicks now. We made and read 690 abstracts from Web of Science, both journal papers and books. We selected more books than their true proportion in the database, because we think a lot of the size and the relevant work might be in the books, and we definitely wanted to capture that. The time window is these four years, that's not really relevant for this study, but we have an automatic study pair to it, and that's relevant there. So the fields we divided the humanities into our history, philosophy, and religion, and linguistics and literature. This is just taken from the Egom classification. Egom is sort of the Flemish institution that deals with research evaluation. With some small changes, like we dropped arts and design, because almost all of their publications are just normal journals, so it's very difficult to compare to the other fields. And architecture we dropped as well, because a lot of architecture is just engineering work, and that's just not typical humanities. So part of it is typical humanities, and it would be very difficult to select only those papers. Then we hired 16 people who were either doing PhD or had already gotten their PhD, at least two from each field, and we briefed them in advance extensively about kids' ideas and the tasks we wanted them to do. And then we divided the abstracts into two sets, because in half of all abstracts, where they look like 10 hours, we thought 20 hours of writing would be a bit too much. So that's a general overview. The rating task we gave them is we gave them 69 sets of five abstracts, and each set had one abstract from each field just randomly selected. And they got those five abstracts on their screens simultaneously, and then we asked them first to order the abstracts. Now, by how likely they think it would be that they would be chosen by the Ideal Committee, so they would just rank from one to five, with the most value of one at the top. And then we also asked them for each separate abstract to also indicate the rating. It's likely that the committee would choose it, so that we also had some kind of absolute measure. So it might be that all five abstracts from that set are ones or zeros. That was up to the raters, or some ones, some zeros. The only restriction we gave them was that two ratings had to be somehow clear. And so if one abstract was ranked above another in the ranking measure, then it couldn't be that the lower ranked one had a one in the higher ranked one. So those were the tasks. For the analysis of the data, I'll say it a bit more about this, because that's where our assumptions really come out very clearly. I think those are the ones that are important to think about and get some feedback on. So we wrote a generative model of the data, so a model that sort of mimics how we think the data was generated. And the main assumption of this model was that when raters do the tasks, ranking tasks and binary tasks, they put the abstracts on some sort of an implicit, continuous scale of relevance. So we assume that the raters think of societal relevance as something continuous, and abstracts are more or less relevant, and this is a latent tribal we can measure. And we're assuming that this latent relevance estimate of the raters, that it somehow depends on the content and form of the abstract, so the reason they give it a higher or lower estimate we think is because abstract is about certain things or has a certain form. But we also think that the estimate depends on certain characteristics of the raters themselves. So they might have certain biases. For example, we think they might be chauvinistic, that they might rate abstracts from their own fields differently from abstracts from all of them. And so from those assumptions, as follows at the ordinal outcome, we will assume that it's simply a ranking of the estimated relevance scores for each of the abstracts. So they get a set of five papers, they will implicitly put those five papers on a relevant scale, a continuous scale, and then whichever one has the highest score just is the top one and so on. The binary score is a bit more complicated, because we think it's clearly influenced by this relevance estimate. So the probability of getting a one is higher if the relevance estimate is higher. But in addition to that, we think it's also dependent on the strictness of a rater. So there might be some raters that are very strict, and even if they have the same relevance estimate as another rater, they might not give a one and the other rater might give one. So the binary score is determined both by the relevance estimate and by the strictness of the raters. So that's just verbally summary of our assumptions. We've also put this in a directly basically graph, which is a representation of the causal relations of our variables. And the outcome variables are this binary data and this rank data. As I've mentioned, they are both influenced by the estimate that the rater makes of the relevance of the abstract. But in addition to that, the binary data is also influenced by the strictness. The estimate in turn is influenced by two things. As well, in the first place, by some latent true value of the paper. And we don't want to be too metaphysically committal about this. We can think of it like IQ in the context of intelligence. It's just a construct that we use to model how this estimate comes to be. We're thinking that the estimate of the raters will not be the same as this true relevance of the paper, but it will be somehow similar to it. You will be influenced by it. And in addition to that, also by it. The biases, the true relevance of the paper in turn, is influenced by a whole bunch of things. These are codes. These are all binary codes that indicate whether the content of the paper has a certain feature. Whether it is about ethics or morality, whether it is about the present, whether it is about representation and discrimination against minorities. Whether the abstract describes empirical research, whether the abstract or the research described in the abstract has an impact on education, on health and well-being. Whether the abstract deals with physical sciences or physical environment. In the final two are document type, whether the abstract was a book or paper. And the field rest is just sort of an intercept that indicates what fields the paper came from and captures sort of everything that's not captured by other codes. The way we got the data for this is we made a coding scheme and two of us read all the abstracts and coded them for these things and then we compare the results and discussed any disagreements until we have a consensus coding. And so we are assuming that these things influence the true relevance of the abstract which in turn influences the estimate of the raters, which then together with these other factors generates the data to actually have the binary and the ordinal data. So then we've turned that into a statistical model, a Bayesian model, which basically has the same structure as the deck that I just described. So there's this one end pointer, but there's the binary rating and the RAM data. Binary rating is caused by the relevance estimates and by the strictness of the raters. The relevance estimate in turn is caused by the true relevance and by the showbizism. So that's the same structure. There's a couple of features that I should probably explain because there are additional assumptions. Firstly, the structure of the model is hierarchical. So we're assuming that the data has a hierarchical structure. There are certain groups in the data and that it's relevant to tell the model about the fact that these groups exist. So more precisely, we have these rater estimates of the abstracts. And we're sort of telling the model that all the estimates of the same paper are somehow in the same room. And what that allows us to do is we can pull information between those estimates of the same paper to sort of all mutually inform them. So imagine there's nine ratings of the same abstract in total because we have nine raters per one loop. Then if eight of those ratings are pretty high and the ninth one was very low, then the model is going to pull that ninth one more far as the mean because we've given them the same prior. There's sort of a truest prior they're doing information. So that's a very nice feature because we can really get all of everything out of the data that we possibly can because we're telling the model about that problem later. But it's also an assumption we're making because we're giving them the same prior. And we do the same thing on the level of the true relevance of papers on the field. So we're telling the model that we're pulling information basically between all the papers from the same field. They share a prior. So every paper of that field will update that prior which is going to pull all the other papers from that field and that way all the papers sort of inform each other. That's very nice because we can pull information but we also get out of the model of the estimate of the mean relevance of each field and the standard deviation as well. But it's again an assumption that we have to be very clear about because we're sort of putting into the model that these are groups. And then the final thing that I should clarify because it's not so typical for Bayesian models to do this is that sort of this part is the generative model. That's how we think the data is generated. Sort of a tag onto it, we have just a classical linear regression with all the causal factors put into the regression and we observed. Outcome variable is this estimated true relevance. We're using this estimated true relevance of our papers to run a linear regression that has the causes throwing into it. The nice thing about that is that we keep the uncertainty that we have there about the true relevance of those papers and put it into this regression. So the alternative was to use means or something like that but then we use the uncertainty and this way we can sort of propagate the uncertainty all the way through the model also for the causes. So is that more or less clear? I personally don't understand it, basically. So I think you could repeat what the aim of the model is again. So the aim of the model is just to sort of mimic how we think our data was generated. So we have our data and somehow this was produced by these readers. What we're trying to do here is to get a mathematical formulation of that process of generating the data. So that's what this model was. To understand the reason behind the decisions that the readers did. To see how important various factors are. We thought this is how it was generated. Assuming that it was generated this way and we have these data. How important was strictness? How important was shortism? How important was paper routes from ethics and so on? Assuming that the data was actually generated like that, our model will tell us how important this was for us. This is probably a good time to ask this question at the end, but it's probably a good time to do it now. Can chauvinism take negative one? I've actually ran two models, one with just one chauvinism parameter for all raters. And I've given it a normal prior with a mean of zero. So we can go their way. And then I've also ran a model with very chauvinism parameters, so each rate at their own chauvinism parameter. And there we saw that most were positive obviously, but there were some strong negative ones as well. The place where I've done something like this before in circumstances like co-taught classes, we've often noted that people tend to be, you notice the flaws in the stuff that's about your own field. Which feels kind of like negative chauvinism, right? Or it might reflect as... They have expertise, so they might just be able to judge it. So yeah, definitely. Cool. Okay, so then the results. Maybe first something about the scale. I haven't said anything about this, but we're working with basically three levels of latent variables. So there is these estimates, there is the paper relevance, and there is the field relevance. And none of these are on a real scale that we know of. It's natural. It's sort of the model that determines the scale on the basis of the ones and zeros it got for binary ratings. You might have seen I've put a logic link into the model here, which sort of determines that our scale will be somewhere between, say, minus four and plus four, because that's where you are at almost one and almost one. But just to give you an idea of this scale, I've checked which papers are the sort of the lowest score and the highest score, and the mean scores, we have an idea that the value comes from this. Sorry, so on this scale you have the binary scale final calculation of sorts. These are the estimated, so this, I should say, this is the density pot, so sort of an histogram of the mean-estimated true paper relevance. So our model gives each paper a certain value, an event-safe value. I'm just saying the scale is from about minus four to plus four. So just to anchor this scale to tell you what minus four means and what plus four means, I've checked which papers I showed on those scores. So this was the paper that goes to the lowest estimated score. I think it is from, should I move this? So I think it is from reagent, more from history. It basically discusses how to, I think, rather obscure texts are related. I'll explain why it got those scores, probably none of the writers knew these texts. Then the, yeah, it also scored zero for all our quantum cores. Then the mean one was also from reagent, but it at least points to the discussion of enduring difficult questions, which might well be something relevant. And it might foster more reflections, so it scored one for the ethics scores. And then finally the one that got the highest score was one about language impairment in children from linguistics. Is to understand why it gets a high score. It also had a empirical component, so it scored one for these three, for these three. I've done the same for philosophy because I thought there would be a lot of errors. I haven't checked for history, but this is the philosophy one. The lowest one was a purely historical paper about Philip Frank. I don't know if you've ever heard about him. You haven't? No, I haven't. I have never heard of him. And you did the Library of Looking Philosophers series, right? I think? Link, kind of a arch-cosmic, I remember, right? Yeah. No causal codes in any way. The mean one was one about Kierkegaard. Also no causal codes, but it does discuss something about dealing with anxiety in what we can do. So maybe that prompted readers to give it a higher score. And then the one with the highest score is very... This one, I think, is about education and the responsible conduct of research, which is a rather society-relevant topic. It explains why it is important. It says there is a problem that we have to solve, and then it says we solve it. It very literally does something that they can do. So that's the scale. And to all the results I'll be presenting will be somewhere on this scale. So first, the raters' scores. So 36% of the ratings we got were a one. You can see there is some variation in between fields. The inter-rater reliability for those scores was very low. Probably not very surprisingly. It was better than chance, so it's not like it was completely random, but it really wasn't very high, so it was... It was obviously between raters. It was a bit better for the ranked data. Still not very good for those. You sort of see the same patterns coming back in the ranked data. So inter-rater reliability was very low. Our raters disagreed a lot. One of the things that might have caused this is childrenism music. Wow, the logistics people really hate literature. Yeah, yeah. So that's something we didn't put in our model because we didn't expect this, but there are some special relations between fields. Like some fields just dislike each other, and some fields are fine with each other. But so you can also see that PhDs from history involved will give their own fields rather high scores. So probably there is some childrenism going on somewhere. That's also what our model tells us. This is for the model with a single childrenism parameter on a scale from minus 4 to 4. It's quite substantial. But there was quite a big difference between group 1 and group 2. So we had this group of 7 raters and 9 raters, and they had different abstracts, but still very different. So that's why I tried a model with varying childrenism parameters and there you can see that the differences are really big. We go from minus 2 to plus 2. And in particular people from universities are not very childrenists. They get very big negative scores. And people from philosophy were quite childrenists. I mean, most people were a bit childrenistic in a positive sense. So yeah, that's an interesting finding. Then the other way to explain the massive disagreements that we have in our model is way too strictness. So people vary massively in how many ones and how many zeros we gave them, even though they got exactly the same instructions. So there were very different differences in raters' strictness and that's reflected when we just summed the binary scores per paper. So for group 1 there were 9 raters for group 2, 7. So each paper benefit, most 9 are 7. You can see that except for the maximum score the whole spectrum is pretty well populated. So there was all of this agreement. And very little complete agreement. And that's also what our model tells us. There were very big differences in raters' strictness. So this is more for the model that varies childrenism and that varies childrenism feasts. So again it goes to minus 1.5 and up to plus 1. So very big differences. Then for the differences between fields these are the mean density plots. So again histograms basically of the mean estimated paper values per field. And they show a pattern that's pretty consistent throughout our data. Philosophy and linguistics have slightly higher means. Literature clearly lower and religion and history together somewhere below that. I should say that for the model with fixed childrenism parameter differences are bigger. It's the same pattern but bigger differences. I think this is the more conservative estimate. And it's reflected in the estimates the model gives us of the field means and the field standard deviations. The risk of saying something really that could be interpreted extremely provocatively. The philosophy curve looks a lot like the sum of two normal distributions. Have you played with that idea? You fill in your own interpretation of what I can mean when I say that. It's off-sign so it's limited. It's only in English. But either way you can see that it's an extreme. I mean it's a standard deviation. Of course as you can see here it's very high. In Street Street my experience of philosophy you'll have ethics papers about something very relevant and then you'll have some extremely obscure history of I mean the physics or I don't know what. For which it's very hard to see how it will be societally relevant. And history on the other hand standard deviation is rather low. To me it makes sense because they're all talking about the past probably nothing super important but it's also all... There's something in this standard deviation that's more interesting than the differences between the fields. So if we use these hyperparameters the estimated field mean of standard deviation to draw posterior predictive samples from the data so what you do is you take a random mean not really random proportionate to how much of them how likely it is in the posterior distribution and you take a random standard deviation by how likely it is in the posterior and then you take I think it's 10,000 samples I've taken here you sort of get an estimate of what given our evidence we can expect to believe about the distribution of value in the various fields and this incorporates both the uncertainty about our estimates in these parameter uncertainties we have a mean and some region around it and that's where we think the value might be so there's that uncertainty but there's also the uncertainty of just sampling from a field now and again you'll sample an extremely valuable or extremely invaluable paper so these posterior predictive samples they sort of take together these two kinds of uncertainty and present them at the same time so I think that's the most the most fair representation of the various fields what we can believe about them and so it's the same pattern basically just a bit more spread out and with very small differences so clearly the differences between fields exist but they are much smaller than what the rater features so now finally what influences the rater estimates these were the causal factors all our codes get picked up by the model that seem to increase the probability in the end of getting a one in the binary score or getting a high rank except for document type we thought that maybe book abstracts are written in a more commercial way which we had to take them from Amazon I think because rap of science didn't always have abstracts so we were a bit afraid that these would be written as abstracts to sell books and we would see this in data but so that wasn't the case but other than that the minority groups has a very strong effect ethics also although there was some difference between the groups health and well-being being able to present also very important and then that linear regression also had an intercept for the fields and it sort of captures any variation that is by field that is not captured by these other codes there you can see that it's really small so here the absolute values don't mean anything because it's an intercept and there's no natural zero or anything so what's really relevant is the difference between them I should have actually just spotted the difference but basically the differences are very small there should probably be a lot of overlap so there are things we're not capturing clearly my literature is clearly lower than the rest so there must be something about literature that we don't know about but that makes it less relevant than the rest but it's not very strong so then just very briefly the discussion I'll just briefly come back to the research questions so firstly to what extent is research in humanity society relevant well 36% of the abstracts call the one the rest call the zero our model suggests that due to chauvinism maybe this 36% overestimates what we should expect to see if chauvinism wasn't a problem and I should also say that we ran a pilot of the same study in a few classes of science students they were just master students and they got fewer sets but they basically got almost no ones so it might be that there is just humanity's bias in our set of raters which is very tricky because on the one hand we want people who are well informed to understand all the abstracts but on the other hand as soon as you are well informed well you're probably not going to be a bit biased because you've committed to a career in humanities so you'll probably find it important more important than the average person the people we used in the science classes they were still more or less in the same group but if you would ask some people in South Africa or people 3000 years from now it will be even way more different so there's definitely bias in our raters and it might inflate the value even more so this is a very rough measure but I think it is some indication that probably a lot of the research that we're doing in humanities is not research that will be chosen might be ideological I think so that's something we can write what to do, well, I should say again to not make anyone angry and I really believe this I don't think this is a problem on the individual level I think it's really a problem of the incentive structure like academia is structured in such a way that we have to write certain kinds of papers in certain kinds of journals where that often makes it way easier to get papers published if there are certain kinds and it's often much harder to do society-relevant research and have a successful career as an academic so I think the main thing that should be done is to just change the incentive structure and make this thing that is mostly rewarded not just publications in specialized journals but some kind of influence that being said, there's one thing that I always say when I talk about this and that is when I chose the topic for my PhD and throughout my PhD and even after that I've got lots of advice from people on how to be strategic on how to successful academic career write this kind of paper choose this kind of topic deal with the reviewers like this and so on but I simply never no one ever told me look, why don't you try to choose this research topic something that is really a societal problem and try to solve it I'm not saying this cannot be combined maybe it can be both strategic and solve a societal problem I think it's often the case that the world really needs to be solved and maybe the chance of getting on is higher but it's just advice that I got very rarely so I think a small difference could already be made if all the supervisors encouraged our students to find something that really needs to be done and do it even if you don't have a career that can be in me at least you'll have turned this societal funding into something something between the fields so as I said, very small differences an estimated loss over the literature seems to be a bit lower linguistics seems to be philosophically higher but the number of readers, all from k-real arts faculty there's dynamics in the model in the data that we've definitely not captured the weird relation between philosophy and literature or linguistics and literature so that's something we have to look into so we'll probably want to replicate try to replicate this with more diverse creators and see if the same results come back but I think we'll always struggle with this trade-off between expertise and lies so I really don't know how to control that and finally where I think our results are most important is in the question what peer-review is useful to societal relevance so our results tell us that even though peer-review is used very widely basically any grant application will have criteria of societal relevance we'll have some questions about what your impact will be but what it looks like is that the individual reader features probably outweigh any differences in value between fields or papers that we that we have to meet so of course not a massive surprise like there's a lot of literature about peer-reviewing grant funding in general and iterative reliability is very often zero there if you look at the top 80% of paper of grants at least so there is a lot of noise so it's the same for societal relevance probably I just chose that it was a massive problem so I think this needs more research firstly to show so there's the question whether it is some sort of bias in favor of your own field like you've chosen your fields we like it more so I think it's more important we're gonna give it more money but it could also be that you just understand and you can see the real value or you see the real problems and there's also the question how it works precisely so now we have people from different fields evaluating papers from different fields which mimics an interdisciplinary panel or maybe an institutional panel but there's similar questions within a field like you rate your own specialization higher or lower your own fields, neighboring fields, far away fields I think these are all measurable things that are quite important because they are such committees are used everywhere and we're assuming these things don't happen so they should be they should be investigated and similar to the chauvinism there was a lot of variation what counts as sufficiently societal relevance to refund it so one question this raises is are academics actually capable of evaluating societal relevance so we're just assuming we ask them to review the scientific or typical proposal they're probably not capable of doing that that's what research suggests as in there's zero interest in reliability but it's sort of reasonable to think but it's not clear but it's also reasonable to evaluate whether a proposal is important to society there's a lot of other people that might be better at evaluating policy makers or just people from the public there are some funding systems that already use non-academic reviewers in the process just for the societal relevance I don't think any empirical research has been done on that but that's definitely something that should be should be considered should be just assuming that academics are good at evaluating whether something is society important because they are vulnerable bias if you're a philosopher you might just be happy to say give this philosophy project more money it needs to be done it is very important I think it's important to check for this and so finally I think it would be nice if funders didn't just assume that all these biases and uncertainty don't exist so that's what they do now they just ask the people to review and then they have some sort of more or less random way of aggregating these judgments either they sum the review scores in some way or they just have a committee meeting and then in the end there is someone who decides when society finds consensus there are always factors that can be modeled they can be investigated so why not use the discount models to evaluate different and try to model these factors I think that would already be a serious thing yeah that's it do you want to take file? for me it's not necessary okay cool questions well thanks and you are both authors thank you the first thing I mean I have a lot of questions but first more I don't know if you saw an email I sent a few days ago there is this open call for a seminar on open science the possibilities for open science in the humanities and social sciences and I know some of the people who organized this in the Netherlands and I think it kind of interests you but it's the kind of stuff that they are interested in you saw about open science itself but I think it could be read because I think they are really interested in understanding what does open science mean in the Netherlands I suggest that maybe you check out well and so I have a question mostly maybe it's more in a way your model seems to work because it shows like you say that it's hard to do socially relevant research in the humanities your results seem to show that and that's not a surprising if you look into account what people are taking to be socially relevant research if it's empirical, if it's present and so on but also because it produces what the funding agency want you to do it's this empirically based, it goes to science if you talk about health or something like that so in a way it doesn't it's not surprising maybe also because of the relationship between peer review and social relevance how it has been historically constructed because I coincidentally peer review and social relevant research was two things that were institutionalized in the 1970s precisely peer review as a means to to make sure that science was accountable to society so I don't know how I can explain this maybe I'm going through but the models does seem to work because it reproduces what the science that is already published the sort of financial accountability of unverifying agencies well it's of course a big circle because we start from these causal assumptions meaning this is how our data is we're telling the model this is how our data is generated, these are the causes of course our model can say well the causes are not very strong or there's always nothing happening so our model does tell us what's relevant it's not so surprising that our model points to these causes because we're literally telling the model these are the causes so you always have to be very careful about this like the causal structure you put in the model is also what your model will assume is just true when it will tell you the values you put on those parameters assuming that this causal structure is true but like you can see that document type for example isn't where this one seems to have like zero impact so clearly we seem to be picking up on something but the document is not reflect sort of control I mean there's a big difference between being a broker or being a magnetics having a paper about magnetics really increases the chances of being selected by the raters of society development but you always have to keep in mind that model cannot point to other causes because we haven't told the model about the existence I've got a lot of questions but I wanted to ask about this idea that you gestured to at the end because I do think this is a really a really interesting upshot of this kind of work is to try to think about operationalizing like distance between fields as some kind of variable to try to get a handle on this question of you know bias versus expertise so I'm thinking about one thing I was trying to kind of crunch on over the question who presented the results especially so what would the parallel study in science look like right and I think it's actually really interesting to think about you know what would a physicist think about a paper about you know community ecology of micro arthropods in moss on a rock in the UK somewhere like you know and the physicist is going to be like hot garbage right and so I think I mean I think there is a really for me that's kind of the sort of maybe the biggest elephant in the room here is trying to think about how to disentangle that kind of bias versus expertise effect that like and this is also a part related and this I guess let me because this is now let me try to turn this into really more of a question than a comment sorry this has been related to so one there is one kind of negative aspect of using abstracts right which is that abstracts are talking to your peers right and so if I'm the biologist writing the micro arthropod moss study I was actually reading a micro arthropod moss study this morning it's one of six samples in my head I'm writing to my friends I'm writing to my colleagues I'm writing to people who will read that and feel of course I understand why that's useful because moss communities have massive gamma diversity in their micro micro arthropod communities so actually it's a super interesting model and like my buddies know that and so I almost there's an interesting question here I wonder maybe another thing to do would be to try to repeat this with grant abstracts where you're trying a little harder to talk to people who won't have any idea what the hell you're talking about it's a good point the reason we decided to go with mostly journal paper abstracts is that the incentive structure make an image such that we spend a lot of our time writing those and that's sort of the basic currency that drives everything so that's a lot of what we're doing is doing those things and well it's true we're not speaking to outsiders but that also means the outsiders probably won't read it and chances of impacting something is not over but it's true that it misses out on part of the relevance of the research content yeah maybe there is some regards to the comment on the distance between the field maybe there is some kind of bibliometric way of making sense of these things I don't know we should also patch up yeah like how separate are the citations how separate are the citations that would be something I would think of and this word about this one of the classifications that you notice is it a web of science that uses these so the web of science classification of different disciplines I think they're based on some kind of bibliometric clustering it's hard to see how it would be to help here these citation kind of metrics they're always normalized by fields so it's very difficult to compare those fields just something like are we sometimes writing in similar journals so for instance there is this journal called linguistics and philosophy they reject my papers but but yeah for instance both linguists and philosophers linguists and philosophy each other's paper the Venn diagram between us and literature is going to be way smaller there's probably like a tiny little thing literature people still read lacquer maybe just putting some biases on the table here I don't read lacquer but in all the big classifications it's just a single thing so all the big classifications and for a lot of fields and humanities is just one small problem yeah that's also because of the mean of course it's hit a problem of bibliometric analyses of this sort if you want to bridge sciences of humanities is that our citation practices suck yeah like she's so terrible in the FWO I think we have to assign some disciplines to yourself but to use the distribution of these and all of them people so about having access to the data right on the research portal well like how many like how many proposals that have a philosophy field code have also carry a linguistic field yeah FNRS has the same we're just probably site papers in another field so I can imagine the micro people with the moss and the critters citing general ecology papers but surely I'm not like that I think so I have a question related to to the idea of looking at other things than abstracts is that I don't know that perhaps the humanities have a way of being socially relevant but it's not through academic publications which might also explain why the the number of citations is not that great so of course if you leave papers as a source to look at you cannot really run but be a metrics but I can imagine for example the person you know this very sort of localized understanding of this issue but perhaps the person working on religion that person is also doing and maybe it's a charity work on the evenings and that is related to his or academic work but not necessarily socially relevant in the way that you constructed and so I'm wondering if those differences between the humanities and the sciences are hard to pick up by method that we are designed to assess more natural sciences than humanities to begin with it's definitely true that almost all the societal relevance of humanities is not in the papers it's definitely true we have those in depth of the analysis I was talking about I think the only way of really capturing that there's other ways like HRC in the UK all universities, all departments have to submit case studies every couple of years that show societal relevance we have some data sets that you can use but the problem remains that the main currency in academia are these papers and as a philosopher maybe my societal impact is not through my papers but I do spend 90% of my time writing those papers so the problem might just be in fact that you're writing too many papers and not doing all the stuff that we would be doing so definitely there is what we're doing is kind of more meta level points it's been years since I picked up the relevant kitchen here so I forget honestly whether he mentioned something like this or not but one one way in presenting a project that might help fend off the angry response one thing that you might say I don't know if Kitcher does say this but one thing that you might say is the ideal committee also might make the meta level decision about what ratio of your relevant work they want and it's actually very plausible that that's not going to be zero especially after you tell explain in the committee that there's been lots of weird serendipitous stuff that came as a result of basic science and crazy humanities like sometimes that's useful for kind of fire up any reasons right this is the argument I always get to talk about this it makes sense for mathematics and theoretical physics maybe which are a lot like humanities super weird obscure stuff but sometimes they do something that changes but for philosophy it's really hard to think of examples of this serendipity that really work I mean this partly depends on what kind of time window yeah okay if you want it if you count every single player or shirt I mean well I'm going to count the positives because I don't think you get a lot of I think a lot of 20th century physics doesn't make sense except it's once you realize that all those guys were reading up on their neocontin you know understanding some space and time it's what I was doing really what we're doing now is just like I said it depends on what kind of time what kind of time window what kind of time window you're I'm thinking about the current sort of paper producing industry which is only at best probably 50 years old yeah something like that maybe 60 years old no yeah I mean that's a different like I said you've got to limit that I just mean I meant the more general point just of the committee could essentially a softer way to sell a project would be to say the committee can't even pick its percentage if you don't know how to like if we can't understand what the percentage is and what the factors are that lead to the percentage being what it is in a given point and possibly also of course very plausibly the percentage the ideal committee would pick would be way higher than 36 right so your main point still stand right but it's just like even if the committee says look you know we think doing crazy shits in Morton sometimes keep 20% for yourself weird weird science and humanities people who deal who do nonsense like cool you still got to know what drives the percentage to have the value that it does and yeah and it seems very plausible to me that no ideal committee would say 36 is the number yeah that feels correct I mean it would be interesting to have research just about the argument because ok it sometimes happens but is it really worth the cost because scientific research is immensely expensive and now and again we have a lucky fool but maybe it's just there are many cost issues right so for instance what if you have very wrong ideas right like phrenology style right there so if you are promoting scientific racism in the 19th century well that's also all the sciences slash humanities so for that then except for me these are generally smart people that work in academia so all of them could have done something useful in society it's all a matter of fact kind of worries yeah it's a little bit interesting but your interest it's this sheltered workplace for highly gifted instead of them being able to contribute productively they are just chatting with each other in the armchair that's also you have to if you want to make cost benefit analysis it can be just a benefit analysis right it has to be this whole thing considered cost I was wondering do you have any ideas about when explaining societal relevance how you can distinguish between the topic of the research and the quality of the research because I guess both are both influence societal relevance that can be a lot of philosophy that is irrelevant because it's about obscurity which topics but a lot of philosophy also is just very bad philosophy even if it is about very relevant topics which has inductive risk what example of a very useful topic about each of these a lot of very bad literature would you incorporate that yeah that's a good point our study is basically only the content because I guess you got assess quality through abstract I think you can basically assume that most of them will say at least average quality abstract it is published so in a sense it is biased in one way already but sorry Stan that's a good point another one is that evaluating philosophy is so hard if you look at the lottery of just getting something published reviewers might accept and reject them it doesn't always seem to have much to do with quality of the paper so I don't think it's totally in a scalable way to look at the quality of the research because there just isn't a way in philosophy except for really reading all the work and evaluating it and even if you did that it's so arbitrary because peer review is so arbitrary then I think for humanities in the worst case if you have research where you have some kind of testable outcome like replicate studies in psychology and they would give some kind of indication of statistics well done with raw power and so for humanities I can't think of a way of tracking that I wanted to ask how could you operationalize your model like how could you make that useful so could you imagine for example a sort of software where you get abstracts and then you can select sort of checkboxes the kind of societal relevance that you're looking for so I want something that's ethical and I want something that is present, empirical and sort of filter so what I'm trying to get at is that perhaps the advice that you can construct can feed this publication machinery that academia has become in one way So the most obvious use would just be to analyze the peer review process itself like if you have a committee evaluating grant proposals and you have all these peer reviews and you can just model the whole process they're using like in the reviewers you have the committee members the biases you think might exist and then just feed everything through a model and control for the biases and when there is and certainties to reviews what are you thinking of this way the other thing we're talking about sounds like the use of the AI for peer review which is also something you're already looking into but you will need something that can actually understand the abstracts because I think just the brute text searches I don't see I don't know it's a question you've asked Trump somewhere We don't have you can't ask chat GBT and please don't make shit up you know that's not yet an available button in the model so until we have that button I feel like we're not going to get there Yeah, this model it's going to be like 1500 parameters that can be as like 30 million or so it's wild I'm also trying to think about so on the kind of practical side so I'm trying to think about most kind of most immediately what could if you walked this to the SWO tomorrow and were like here data what would be like immediate impacts on say grant review what would be the best immediate impacts on grant review I mean partly I guess maybe this this invades some in favor of a lottery system probably I mean I've gone to the SWO like three weeks ago with another data set not this one but also one that shows there's problems with peer review and obviously I think it's just to reduce peer review like lottery or baseline funding both fine for me but people get very defensive so nothing why and and we've been trying for a very long time nothing will happen even though the scientific evidence has been showing for a very long time and the use of peer review doesn't make sense here yeah it's very tricky because it is very clear that there is a problem it's people within the system get very defensive understanding so that's a bad thing it's also understandable that they are very conservative because when they make changes there's all the time they know that the system they're using at least they know what it does so they will be more reluctant to change things than I would be because there are no costs for me yeah there's costs to the current system as well but I think it's obvious that people like FWO even if they don't want to change everything immediately they should run experiments there's some of the two experiments like in New Zealand and the Volkswagen Foundation and the Swiss funding foundation they all have their own lottery lottery experiments running I think it's obvious that the minimum is to try all their things just to get some data but yeah, peer review is and the things I've said here aren't even the most important the main problem with peer review is that it's immensely expensive people spend so much time writing events and typically the professors will cost a lot even if lottery would allow cranks and dabblers that the peer review would stop which is not so clear even then it would be cheaper to just fund those cranks and dabblers and then lose that 5% it is change we will change for you but I don't think data like this will change anything it confirms that just these greater features are so important in making these evaluations any signal you're trying to pick up about the value of the work is just drama drama drama yeah, yeah and I guess what I was also trying to play about play around and I had to think about so what changes when you go from retrospective to prospective evaluation but I guess the short answer is it only gets much worse yeah because that's not in here we were just talking about this is backwards evaluation in the sense that well you assume that at least somebody thought that all of these papers were kind of okay yeah, so grant funding is you have to predict which no one can yeah, so that's just even that's just going to compound all of this one thing that I still wanted to ask as you mentioned but I never followed up on it this difference between chauvinism as a bias in favor of your fields and some kind of expertise effect can anyone think of an empirical way of teasing these parts if you were thinking about this but it's I just don't see a way of setting up an experiment and see what's going on there it's part of it is and I wonder if there is a way to get at this in survey type questions for me it feels like conception it's not a way of testing it but it feels like the same word but different balance chauvinism has a very negative balance and expertise has a very positive balance but they are, for me as I see it it's the same coin you become chauvinistic by being an expert what did that sorry there is either value there or there is not and so in one case the rate is overestimating the value that is there and in another case the other readers fail to see the value because they don't have the expertise go to the field developed field developed histograms which ones? yeah these are like one thing that I thought was really interesting right is so philosophers really hate literature but also literature look at how many literatures five other literatures which is weird and so a part of what I'm thinking and I don't know how you get at this maybe there is a way to get at this in a survey question if I'm so if I'm looking at an abstract there are two ways that I might decide to give it a zero I can decide to give it a zero because it's like look the problem even if you did like banger research on this question this problem is just you know this problem is not societally relevant right and that's the sense that you guys wanted wanted people to use to evaluate right so like imagine that this is like God did the research so you know it's awesome um it's still not relevant sorry but when I see that like that like five literature thing there part of me wonders if part of what's happening here is like this becomes like phrase this like anything on this question would suck whether or not it's societally relevant whether or not it's just because this is like a garbage part of our field and therefore any work that occurred so you're almost even though you were trying to screen off quality you're making a kind of concealed quality evaluation so like maybe here I'll you know it's not like this is live on the internet or anything you know like maybe I'm gonna five any philosophy paper about Derrida just because I'm like look I even care if you think it has societal relevance it's not gonna because it sucks right and so I'm making some kind of like a masked quality judgment and so trying to think about how could you kind of screen off those like masked quality assessments in this kind of context and maybe maybe if you give people you know you give people a hundred abstracts but you don't tell them that they were published and you start by trying to get people to just do a scientific quality assessment and you take the top band and you say okay you thought all of these were good which ones are society now do the society really relevant I mean you would because the effect you're talking about it goes in both directions right maybe there's some part of your field that you think is so awesome that any paper from that field you'll give a high score yeah now I can more easily come up with examples like that of the sciences but like you know so you could think that like whatever biodiversity loss is so important that all ecology papers are like intrinsically society relevant right now because we desperately need to know a whole lot more about everything pursuant to ecology so it just doesn't matter they're all great yeah no you're right and that would only so that would screen off the negative but it would leave you with an overinflated top end but yeah maybe we can just ask explicitly right for each abstract just indicate whether you think the quality is high or something like that maybe by making like made up abstracts where you know for example that the quality is really bad and that the purpose itself as being society relevant but you make up one of those abstract like those something like that where you like made up words from you know if you know they are philosophical literature you make up something from philosophy that sounds really so relevant but it's just made up words and concepts and then maybe you filter out maybe precisely not your question but you filter out people who are assessing you're filtering out not knowledge of that field but they might still think it's historically relevant so you're not making up on on charliots yeah but for people within the field you could filter things out I don't know if you're playing along with the abstracts and making some bignets you could get to filtering out the downside of me is that you always divide your population by the number of versions you have so you need a much bigger population to get the same kind of data so I forgot the number one means that it's really good at 5 really bad so they don't really want to know 5 bad? okay they don't like themselves not so much but they are very constant towards the rest whereas historians are full of themselves but otherwise very ecumenical yes one thing that's also relevant mentioning is that for literature we only have 2 raters and for the other fields we have a few more so maybe there's more maybe it's just a slight bit of sampling but still but still it's interesting that they are so non-jewy your religion is also extremely weird right? they're like Super Zen I look at that and I'm like the religion people are very objective it's just so flat I hate it oh yeah that's true that was a change I know that here but if you see there's a switch so for religion you can see there was a Q1 rate there was very negative so they just the 2 raters cancel each other out one guy was super harsh against the other religion projects and the other guy was not got it that shits out how did your recruitment process look like we sent emails to all the departments that carried over we didn't go through supervisors we didn't want to put any pressure we wanted to make sure we didn't put pressure on the PhD students to participate we just sent the emails saying this is how much we paid this is how much work it is this is the deadline and then people signed up and why did you not get paid so we wanted people who were experts because they had to be well informed there is no way we can get professors to do this there's probably no way we can get postdocs to do this among PhDs definitely at the arts faculty there's quite a few unfunded ones who always are looking for jobs on the site just practically it was really fine to pay them we didn't want to do it again but we basically took any PhDs we could get it was really difficult to find these 16 we did a pilot with four philosophers but I think if we tried to replicate it we would go through some kind of online project just don't know how to disentangle these it's not so important in a sense all you need to know is that they are different thinking about improving the evaluation process it's not so important to know the causes or in what direction they're biased true but it's conceptually interesting well I guess I don't know I don't think this is the right approach but one way to really dig in your heels here would be to say you know this is just dominated by essentially not knowing what you're talking about right and so you could I mean that's what I think makes this difficult for a policy perspective you actually could look at most not all because there are still problems obviously but you could look at a lot of this data and your response could just be to retrench basically and say that yeah exactly you see what this shows the only people who understand how to read literature papers or literature people right so we need small peer review panels of our people to review our grants but then the factory such heterogeneity within the fields as to chauvinism is a major problem right then you still have that problem and I suspect right I suspect that you could disentangle them because I suspect that if you could the disentanglement would also speak against that right I suspect that it wouldn't just be that it's about knowing more about the field yeah so that's I think it would be cool in that sense disentangling the causality would be cool because it would also help you kind of stop fighting against that retrenchment if it's true but I'll be damned if I have any idea how yeah there is can probably make a very simple argument for that so I guess that even if there is so much chauvinism for one of the religion writers and so little for the other or yeah it's negative for the other I guess one of them as well wants the other the other doesn't and just then take some kind of pernui all say well there is no way they could if they really had the same view of things if they were the same the same idea about these things it would be very weird that they would have yeah but there's differences in expertise even within the field and maybe the religion then it seems so then it seems so it seems very contrived right in the following sense well the battle should have the exact right people take the right judgment for these specific proposals that's just an impossible ask yeah it's also if you look at the abstracts we read all of them then for many of them there isn't that much background knowledge that is necessary to know whether it's society or relevant or not to take the one that got the lowest score okay I don't know those texts but if even I don't know them then I know that almost no one will know them it's extremely unlikely it's going to be very relevant so I think the easier explanation is just these are people that are doing a PhD in that topic that clearly are interested in it that committed to spending their life like that well probably they're going to find it more than average important I mean I think that's just the easier explanation and could you also help explain the fact that we all have this idea of what societal relevance is that's why when we see this which is your text which is your religious text that we already very do it in this area it's not societal relevant and so I don't know how you could code in the fact that they are different I'll try to get the information of what societal relevance is for specific people because then we might see the kind of that it's actually a thing I consider to have very parameters for all the causal factors for all the raters sort of to see maybe there's like one rater that always gives the morality of one and one rater that always gives the doom because I'm better at big differences there as well and because for an historian it's socially relevant might not be for a philosopher or the way around so we didn't ask them to say what they think is societal relevant to ask them just imagine this process we didn't ask them do you think this is important to me we really try to emphasize to lots of judges like that of course they did in the end because there are missing differences so just to understand more clearly your data shows that peer view is not effective in picking out socially relevant research well it shows that whenever differences the model says there are in societal relevance are just washed out by differences in rater view so there are probably differences in societal relevance but there are so small that the process cannot capture that because the process has these biases in it so the only way to do it with such a process is to have a good model of the process and to use that model to tease out how various things play role Max's question gets to something that I wanted to ask when you were doing your like rater instruction could you tell I know this is an impossible question to answer but still could you tell whether they were kind of taking seriously the idea that they were supposed to be doing this kind of weird evaluation view from nowhere thing there were some who clearly got it but there were also few for whom I really wasn't sure they actually understood it also the emails they sent afterwards it was not so clear but yeah we had certain criteria to include the most pre-registered so we got just like exclude their data so definitely there were some raters that I don't think really got the whole picture scheme we really tried to explain it and really tried to emphasize but I think definitely after 8 hours of rating right I wanted to ask how many how many did they have to read individually 345 oh yeah I mean I can understand after they took 8 hours we gave them a couple of weeks to do it so they didn't have to be in one context but this is very coming from experience with people but after a while people just default to whatever is energetically easier and if you want then maybe some of the SMEs are energetically easier and I'm trying to really stand out from this beautiful nowhere yeah we also tracked the time they spent and that's very very strong in between raters some people get very seriously some people just we only excluded raters if they had if they had a couple of blocks that were too short to actually read the abstracts that was never the case so we think that at least they read it but this exclude one rater because he had like 8 blocks with inconsistencies between the ranking and the binary oh we ignored the the constraint yeah I mean we had pre-registered that we just think it means that they are not really paying attention right good time for that yeah that's good so having the time and having the constraint that doesn't mean you have to not be completely asleep yeah that's good no that's really important I know it's a big problem with life I think very hard if you're ever going to do mechanical turks studies you've got to come up with some strong criteria for like how am I sure that someone actually used like more than one brain cell to complete my study good yeah that's good okay what are you planning to do with it where are you going to write it up and send it yeah I'm not really sure yet so I'm still considering whether we should try to replicate with other raters as well sure also we did pre-register but we didn't pre-register this exact model like I hadn't figured out whether we are getting more than one and binary data into one model for the models that we pre-registered so it's exactly the same ranking as but I think this model is a long meter so it would be nice to pre-register this and now replicate and have a really tight but yeah sounds like the logistics for this work kind of shit yeah still the entire review is all one we had to pay them through and give it to their staff that makes it extra oh yeah right you can't just like put an Amazon gift card on mark down below and pretend like pretend like we're all good yeah it would be fun to do the study in you live right actually based on my experience of what it took to get me to pay three students to be like chat moderators during an online conference during COVID it would be no easier if they could just fill out our F1 form and I could just say oh right so when we do it I need to hire your people and when you do it you can hire our people and then we're good to go because then it's the external reimbursement form and it's easy really? big brain place because all of you researchers are good so maybe you just need to talk for a moment yeah you need to contact during my days of despair had a lot of time to think about this in between the weeping for another study online survey we promised to donate for every participant one euro to a charity and it was also more miserable to get it paid you can buy so many things but giving money to charity it's nearly impossible what do you mean you're not buying anything? we're just giving what? what do you get for it? very cool well done thanks so much