 Hi, everyone. Good morning. Good afternoon. Good evening. We're very happy that you have decided to join us for our session on measurement and more specifically measurement issues in psychological science. My name is Estelle Maasen. I'm a PhD student in methodology and statistics at Tobuque University in the Netherlands. And today I co-freed Andreas Tughebelz, Jessica Flake and I will be presenting talks on various issues related to measurements that we found in our research efforts. I will not spoil the contents of these talks too much. I will say that our talks hopefully will convey how crucial it is to give attention to measurement in your own research or when peer reviewing research or when consuming research. The order of the presentations will be ICO, myself, Andrea and Jessica, and we would like to ask you to ask your questions in the Q&A box at the bottom of your screen. We will get to those after each talk and if we don't have time to answer all questions then I am sure that our panelists will be happy to do so via text in that text box as well. Without further ado, I would like to introduce to you our first speaker, Dr. ICO Fried. ICO is an associate professor in clinical psychology at Leider University in the Netherlands. He will be talking today to us about depression, more specifically in the limited progress of the research surrounding depression. If you're ready, I would like to give the floor to you. ICO, go ahead. Thanks. Thank you so much. I hope everybody can hear me okay. Welcome out there. All the screens far, far away. Thanks for organizing this semester. I'll try to keep this general. I'm aware I'm setting up the symposium here. I will be talking about a construct, a major depression, but I hope that all you out there can try to think along if the things I mentioned apply to your constructs that you're all working on. Whether that be personality, cognitive abilities, grit, or some other constructs. All right, I'm publishing a preprint Monday, if all things go well, together with Jessica Flake and Don Robbino, and this talk is based a bit on this preprint and a bit on the work we've been doing. Don Robbino has called this preprint my manifesto because I've been working on depression measurement for a few years and Don has been teasing me since 2015 to finally write this up and we found a nice outlet now so I'm going to give you the best of. This is not necessarily my own work, but the best of depression measurement. First off depression. I'm not a clinical psychologist why should I care about your talk I go well I'll tell you why depression is super common, highly prevalent disorder, very debilitating many people suicide rates are very high comorbid disorders are very high and so forth to severe a current disorder, and it's very costly to families, people affected lives universities the economy and so forth. The second reason why you should care is that I'm going to argue that depression is the most commonly assessed construct ever across time and space in all sciences, why do I say this. In 2014 paper in nature that looked at the 100 most sign cited papers across all sciences ever. And three of those papers are depression scale validation papers. In 1961 1977 the Hamilton scale the back scale and the CSD by Rodloff. I looked into this last year they had about 81,000 combined citations there's got to be many more today, and each of these scales has been cited in over 141 disciplines, each on the web of science. All right, I mentioned this already here is Jessica flake and me meeting a few years ago trying to look into the validity evidence of depression. I couldn't make it to this photograph but that's what I will do today talk about validity. All right, for all of you out there who don't know how to measure the question. What do we do. Well, we give people either a questionnaire similar to an intelligence test or we interview people clinically. So here how we do that. These questionnaires or questions have a number of symptoms. How sad are you right now, how pessimistic have you been last week. Are you worried about the future these sort of questions suicidal ideation and so many and so forth. So we assess them. And then we add them up to one score 99.999% of all papers do that. They can either be a sum score, but you have 47 points, or it can be a 01 score you have depression or you don't have depression. And then we use the score as a dependent variable predictor moderator mediator and so forth. Common uses for such scores are many. The most common ones are probably to diagnose patients. Yes, no. These control studies. I don't know comparing the genetics of people with depression versus without depression and tracking treatment over time. But again, many personalities psychology or social psychology studies will also just try to control for depression, adding it as a co co founder potential co founder for example, which is also another comment. As I said we want to talk about the issues today. So I have a box of issues for you today. I'm going to talk about five things I think, depending on how I'm going to do the time. And yeah, let's dive right in. There are over 280 different scales for depression that have been used in the literature in the last 100 years. Many are still in use today. A recent review paper identified 19 different measures used as primary outcomes in 30 clinical trials for depression. It's a complete disaster. It's the very first sentence of the most cited depression scale published 1960 60 years ago says the peer appearance of yet another rating scale may seem unnecessary since there are so many already exist and many of them have been extensively used. Of course, after Hamilton many, many, many dozens further depression scales were developed used validated and so forth. Other studies usually use one single scale there are some exceptions but the norm is you use one scale to be the either Hamilton the back and forth. But then you draw general conclusions about depression, we found that depression is related to female gender. Which relies on the assumption that scales are interchangeable doesn't matter what scale you use you would find the same result. That's not the case. So you see me here at the bottom right during my early postdoc my first postdoc was a while ago. And I looked into this question how interchangeable scales are this is the paper. What you can see here is I looked at the seven most commonly used depression rating scales in different colors here, and we identified being very conservative 52 different symptoms. For example, the CSD the green scale on the right side you can see lots of green dots on the right side. This means that these symptoms appear only in the CSD and in none of the other six scales. So there's massive differences in terms of content. You can look at this more in the paper. I'm not going to do the details of this graph today. 40% of all symptoms appear only in a single scale and only 12% of all symptoms appear across all instruments. And the summary of this is, you know I couldn't put this in the paper unfortunately but this. It's bad, I think, because if you do research on one paper on one scale, it might probably not generalize for another scale because you really measure different constructs with these different scales. This is a diagram of agreement on how to conceptualize and measure depression. And, of course, with all these different scales, it opens the door for for the exploitation question research practices and so forth. Because you could just assess two scales and report the one where you find the facts because the scales are often correlated point four point five with each other. Okay, I go. Yes, these are all scales and so forth but what about the DSM what about the diagnostic and statistical manual of mental disorders that has nine symptoms for depression in there. This is used for clinical interviews clearly that is better than these all these scales that we've been talking about I go. Right. Okay here the nine symptoms of depression and the DSM. Those of you who look at this right now might see the oddity that there are eight or is in there. It's always a little weird I think psychometrically to ask questions with or in them. You might also notice that three of these questions are literal opposites you sleep too much or too little yes or no. Three is my favorite increase in weight or appetite or decrease in weight or appetite. Yes, or no, is the answer. These are yes or no questions. And I think we calculated that there are about 10,300 combinations of these symptoms that all uniquely qualify for a diagnosis of depression. You can see 10,300 patients, none of which has the same symptom profile, all of which have the identical diagnosis of major depression. We looked into this empirically that there's a new paper out last week actually from Lorenzo Lorenzo losses and me where we replicate this finding here. And the use is that the symptom profiles are not just mathematical speculation they're realized in patients. So here we looked at 3700 patients and we identified over 1000 unique symptom profiles, just using the DSM symptoms. So they're massive difference between people. And that raises the question what it means to have 14 points on the rating scale, for example. And then we looked at the statistics. Jessica and the other panels will talk about this more, but you can imagine that given that these scales were developed in the 60s and most of them have not been updated at all the ham scale is still untouched has the same items the same content everything has remained the same. Of course these scales have terrible psychometric properties. For example, they're all multi dimensional. Everything between two and seven factors have been extracted. Replicational factor structure is terrible. And the scales also lack temporal measurement and variance, meaning they measure different constructs over time in the same population, which is a big no no in psychometrics. And other thing, if you look at these items from these scales, what you do not see is the graph I show you here because I submitted the data on my computer. This is the case. If you have sick people and healthy people, and it's a category measles, for example, very few people have a little bit of measles right you have measles or you're healthy you get this distribution. And for depression, this is real data set here of, I don't know 30,000 people for depression you get somewhat of a normal distribution or slightly scoot distribution, and so finding a sort of proper cut off here is quite arbitrary. Nonetheless, case control studies are the norm and we categorize people into ones and zeros usually, which is simply not called for looking at data. Yeah, this is the Hamilton writing scale. These are two and a half thousand people. All of these items are added up and I simply visualize the correlations here for you, you can see that item 17 correlates negatively with many, many other items. This is not negatively phrased. This is how the scale was designed. And yet we add items 17 to the other items and have been doing so for 61 years. It's really quite remarkable. And so red, sorry, red edge here means negative correlation greens positive. Overall, yeah, psychometric quality is very low. And scales are also not used as intended. For example, Hamilton said in this 1960 paper do not use my scale for diagnosis. My scale is only for severity in already diagnosed patients. But we use it universally today to establish diagnostic prevalence rates. Hamilton said don't use my some score of the scale use for separate scores make for sub scales but nobody does that today. Transparency. So looking at the Hamilton scale here. The Hamilton scale exists with six items, seven items, nine items, 13 items, 16 items, 17 items, 21 items, 24 items, 31 items and 34 items. The dimensions have been translated into dozens or many have been translated into dozens of languages. So there's probably, I don't know 5000 version of the Hamilton scale out there. For measurement, it's really important to know what scale people used. And so with together with a master's in a mindliness Neumann. We randomly drew 100 papers from clinical journals using the Hamilton as primary outcome 40% did not provide any information on which version of the Hamilton was used. And only 8% justified why they use the Hamilton in the first place. The last point is in the rate of reliability. So you send in the decent by field trials. Each person, they sent thousands of people with mental disorders to two clinicians who were blind to each other and the two clinicians gave each person a diagnosis and then you calculate the correlation. So that's the rate of reliability of these clinicians. There were many, many clinicians and many, many patients. And for major depressive disorder reliability was among the lowest of all mental disorders with a pooled couple of point 28. Yeah, I don't even know how to call point 28 kappa I think it's below the lowest threshold in the book so that would be abysmal or something like that. I'm going to skip this so we have plenty of time for other talks, you can just trust me that point 28 into rate of liability is abysmal especially when it comes to making important choices for people's lives. You can look this up in the slides I send around later basically read. I'm going to do this for 10 seconds. I simulated data here on the left side you have the true prevalence for depression. So once our depression yes zero our depression no I use the prevalence of 30% and I simulated two clinicians with a couple of point 28 on the right. Again one is diagnosis zero is no diagnosis. And all the reds are disagreement among clinicians with a couple of point 28. That's not where we want to be today when it comes to diagnosing mental health problems. So summary slide for the day that's if you only pay attention for one slide today. This is the one. Right. So it's, it looks pretty bad unfortunately. And knowledge about depression is largely based on studies with a specific scale on scale, meaning there's huge issues about generalizability and replicability because so many skills exist and they all differ from each other. The most commonly used scales today the Hamilton is still the most commonly used scale in anti-depressant trials today 92% of over 550 recent trials use the Hamilton as primary outcome. But it's from the 50s from the 60s. And it's not a good scale. And scales and also these criteria like basic psychometric properties such as the dimensionality and measurement. So in the paper we talk about ways forward a little. There are many the most important one to me is from measurement research. And don't tell people like us who try to publish measurement research that you should publish it in small specialized journals. Because people who read small specialized measurement journals know about these problems are ready. These are not the people we need to talk to. We need to talk to the broad fields of ethics. The measurement management attitude that we have in the field that measurement isn't important measurement is the heart of science also psychology. We need iteration scale validation is an ongoing process that requires update and it's it's the idea to make a scale and then just be done with it. After a small validation studies ridiculous to me, quite frankly. I strongly recommend the book by has a chance called inventing temperature, in which he talks about the possibility of informing theory from fallible measures, which then again informs the measurement and so forth in a nice positive cycle. Jessica and dawn for the tremendous hope on this paper. And thanks for you for paying attention I don't know if in Europe and then it's late or where you are but thanks for joining us today and I'm looking forward to the other three talks in this impossible. Thank you, I go. I'm going to check the Q&A there are no open questions at this point I do have a question if you're open to it it's a more of a general question. I was naive I was naive. I thought if there's one psychological construct that we know that we can measure it is depression, right. I think this is a general thing that you might hear a lot. And it's not the case, apparently, it's abysmal, or as you said a complete disaster. My question is a bit more general, are there constructs that do better than depression in this sense. So I think I gave a longer talk on this this morning. My view is that mental disorders might not be real categories in nature. That doesn't mean they don't exist I just mean that the categorization is somewhat arbitrary. And I think the goal of in clinical psychology and psychiatry is to come up with phenotypes that are maximally clinically useful. I think a specific spider phobia is maximally clinically useful because if you tell me, Esther you have spider phobia. I know what symptoms you have. I know the urethiology most likely and I know that 10 sessions of CBT are going to be or behavioral therapy are going to help a lot. But if you say you have depression I know next to nothing about you. And I don't know what symptoms you have I don't know your etiology I don't know what treatment I recommend. So I think specific phobias are much easier to measure much clearer CRISPR phenotypes with more information behind them. That makes sense. Thanks. At the brink we also have a question in the Q&A and we have time so I'm going to ask it to you. You've shown that the different scales to have different aspects of depression, possibly, but empirically do the major correlates of depression replicate across scale. Yeah, that's a super interesting question. Andrew thanks for asking that. So the empirical data on this is very scarce because most folks have not run 234 measures in their studies. And that makes sense. It's a lot of burden for patients to fill out these questionnaires or for clinicians to do multiple interviews. And so of course for feasibility reasons we haven't done measurement research in this area. There are some pharmaceutical companies that have started running 304 depression scales in their clinical trials. I think often with not the best intentions. And if you look at these studies we can see that scales respond quite differently to certain treatments. And that makes sense because some skills are quite heavy on somatic symptoms, for example, but entity presence often cause somatic complaints in patients. And so as a pharmaceutical company, for example, you don't want to give them a scale with lots of somatic symptoms because they often increase sleep problems increase lack of libido increases and so forth. So I've been trying to get this funded Andrew but it's really difficult what we would need is a large scale study across, I don't know, somewhere that say 100,000 people will fill out five or six questionnaires on depression and a large number of constructs such as gender, I don't know, impairment and so forth personality neuroticism. So we can see how stable from a normal logical net perspective the relations between depression constructs measured by different scales and other important constructs are that we know to be somewhat related to depression my wild guess is that that would be considerable variability but we honestly don't know there's no data at the moment to again I go. I'm going to move on to the next talk which is me. Let me share my screen real quick. There you go. Okay, so. Like I said, my name is estomassa today I will be presenting a project that I've been working on with my supervisors, and with a colleague of mine, Domiano do so and I'm giving him a shout out. I'm going to go right away, because he and I have spent countless hours assessing many articles in psychology on their use of and reporting of measurement and variance and this study is just as much his as it is mine. Our study is a descriptive study. And before I delve into it, I first want to give a short introduction a short summary regarding what measurement and variance testing entails and why it is important. Today, we have a researcher, and that researcher is interested in compliance to covert 19 guidelines compliance with covert 19 is not directly measurable, not directly observable, and as such, to measure it the researcher decides to use a scale composed of three items that are self reported. The three items measure, keeping six feet distance washing your hands and using face masks. And this is what you can see here represented in this measurement model with the factor compliance to covert 19 guidelines on top. And the researcher is specifically interested in comparing compliance to covert 19 guidelines across American and Dutch people to do so they would traditionally take a survey of these three items in both samples. We would translate a mean score or some score for each subject and then conduct some statistical analysis like a T test to see whether these two samples differ significantly on compliance. Now, the issue is, when we add up the scores of the items to a some score to a mean score we implicitly assume that the scale is measured the same in both groups or that in other words measurement and variance holds. And there are a few ways to test for measurement and variance for the sake of time I'm only going to discuss measurement and variance in a context of confirmatory factor analysis. In this presentation, and this appears also to be the most common way that researchers assess measurement and variance. Now, measurement and variance is tested in steps of which the first step is assessing configurable invariance. And this means that the measurement model is the same across groups, or that the same variables are measured by the same factor. And this means that configurable non invariance occurs when the measurement model is not equal. And we see here for example that wearing a face mask is indicative for compliance to covert 19 guidelines for the Dutch, but not for the Americans. So if configurable invariance holds, the next step is to test for metric invariance or whether factor loadings are the same across groups. And what you do is you constrain each items factor loading here displayed by the lambdas to be the same across the groups, and then you test model fit. Of metric invariance you compare that to the fit of the configurable invariance model. And the whole reason that we want these factor loadings to be invariant across groups is because the factor loadings are an indication of the strength of the relationship between each item and the factor. And if there's non invariance as you can see here in the third item. The relationship between the individual score on that item and the factor differs across groups. This can be represented also by two regression lines as you see here. And these regression lines should overlap but as you can see here the slopes are different that is that loading difference. Now, if all is well and metric invariance holds we can check whether scalar invariance holds. And what we do then is we check whether the item intercepts are the same. What we do is we keep the factor loadings, which are here still represented by the lambdas constrained to be the same across groups. But now we also added some intercept parameters that you can see here by the news. And we constrain those to be equal across groups as well. Now, again, we compare the model fit with that of the metric invariance fit. And we assess whether scalar invariance holds. If the variance would not hold this can be seen here with the third intercept again. It would mean that there is an consistent over or under estimation for the entire group on the factor, regardless of what you score on your covert compliance factor. So as you can see here there is an intercept difference between the groups even though the lines should overlap. So to summarize, if we want to compare groups on the compliance to covert 19 scale, we have to be sure that the scale functions equivalently across groups, and to safely do so measurement invariance must hold. Now, if we would like to estimate mean differences between groups on a certain scale, like you would for instance do with a T test or with an ANOVA. We want to make sure that scalar invariance holds so that the intercepts are equal. Now, if the hypothesis of measurement invariance is rejected. It does not necessarily mean that you should stop your analysis give up on comparing the groups altogether. What you could do is you could use modification indices to identify which item or items function differently across the groups. You can allow them to vary among them this is possible as long as there is a sufficient number of items. And the model does not become too non invariance. That was my short introduction on measurement invariance and our goal with our study was pretty simple. We wanted to find out whether it happens or basically we were interested in whether researchers that are interested in group comparisons. To do psychological scales. How often and do they actually assess their scales for measurement invariance. This was the main research question of our study. Now our study focuses on two things. We were interested in studies that made comparisons between groups or over time. That means groups like control and treatment male and female pre intervention at intervention or follow up. We were interested in studies that measured psychological scales or psychological constructs that are measured by scales. And the way that we define those was by including all scales that adhere to either of the following things so we included any scale that was measured by at least three items. We then had a reference to a validated scale so for example if depression was measured we would find a reference to the BDI for example. Any psychometric properties were reported that could be that they mentioned that they do a factor analysis, or if they that didn't happen then we would also include scales that reported some form of internal consistency measure like a convex alpha usually. Comparisons across groups or over time that had such a scale should technically have checked for measurement and variance in the groups. Now we only selected articles that had open data. So our study contains a reporting step but we also have the availability to reproduce the results from these studies if they have done a measurement and variance check because they shared our data. If they did not do a measurement and variance check then maybe we could do so with the data that they shared. First I will discuss the reporting part of our study. So we chose two well known journals in psychology to sample from psychological science and plus one. First we sampled all articles published in psychological science in 2018 and 2019 that shared our data that were 213 articles. And to keep it equal we also sampled 230 articles from plus one from 2018 and 2019, all of them open data. Many of these articles contained either multiple studies or multiple comparisons within a study so this actually ended up being more than 1800 studies or comparisons in total, but not all of them were relevant or eligible for our study. So of those 1800 something studies, some drops out, because they used non human data, or because it was a simulation study or a meta analysis or a theoretical paper. Then some studies dropped out because they didn't make any group comparison, or they meant that they used one sample or maybe they had multiple groups but they didn't do any statistical comparison between the groups. Then some studies didn't use a scale. And that could be that they only used one item, for example, but it could also be that they had a scale, but that they tested the items from a scale separately. And then, finally, some studies dropped out because they did not have a reflective scale and by a reflective scale we mean any scale that combined multiple items into either a sum score or mean score, but then did not adhere to our definition of a psychological scale so that means no reference to previous scales, no psychometric properties and no internal consistency measures given. Now, we ended up with 50 and 46 articles for a total of 912 comparisons. And we will keep our sample split on these two journals and I'm going to show you simple bar plots for the results like I said, this is a descriptive study, we have not performed any statistical tests. And before I show you the main result on measurement and variance I would like to emphasize another quite interesting result. We checked whether authors used an ad hoc scale, or whether they used an existing scale. And we defined existing scales as skills that had a reference to a previous study or previous validation study in which the scale was used. And an ad hoc scale could be scales that were either completely made up by the researchers. It could also be existing scales that had items added changed removed. And, yeah, as you can see quite a large, large proportion and even the vast majority of the comparisons in psychological science were done with these ad hoc scales. There's nothing technically necessarily wrong with an ad hoc scale of course but if you change the scale in any way, the proper way to go about is to do a validation study to check if the questionnaire the thing that you use to measure still measures the same thing as it did before. And we did not find any evidence for that in our sample. And now on to the main results. Again we have 912 comparisons across groups or time measuring a construct, technically all 912 of them should have done measurement in variance testing, or if they could not do it. If they did not do it because they for example had a very small sample size then they should have, at the very least, reported that it was not possible to do so and how that limits the interpretation of results. Of all 912 times a group or time comparison was made on a reflective scale, only 41 times measurement in variance was reported on which is almost 5%. So that's quite problematic, I would say. Also note that we interpreted this question quite broadly so even if they only reported that there was measurement in variance without any additional information. They were still included in this yes category here. Now, we wanted to investigate that 5% of our sample that did report on measurement in variance a bit further so we're going to drop out all the purple studies here. And taking only those comparisons that reported on measurement in variance. We are left with 34 comparisons for plus one across four articles. So as you can see here there's a lot of dependency as well right so one article here accounts for 30 of those 34 comparisons. We have seven comparisons across three articles for psychological science. Now if you look deeper into these articles mentioning something about measurement in variance. We find that in plus one. For two comparisons, it is reported that scalar invariance holds for tenders partial and for 22 it is unclear. And these 22. It means that the authors mentioned some form of measurement in variance holds but we do not know which level. In variance has five metric one scalar one not reported. Yeah, this results quite problematic right, even if we assume that this partial invariance results also allows for a proper mean estimation because partial invariance means that some items are freed and some items are constrained. Even if we assume that these allow for proper mean measurements only 13 out of 912 comparisons have reported adequate information on measurement in variance and actually found that the appropriate level of measurement in variance holds to make these mean comparisons. So quite damning I would say, even though group or time comparisons occur often in research, we found less than 5% actually checked or reported. If they report, it's often unclear which level and often if the level is not enough. Yeah, and finally poor reporting standards were found overall we had a quite a hard time finding if the scale that was used was validated. And if and how many items were added or removed how many items were in a scale what measurement scale was the sample sizes per group, etc. Okay, that was the reporting part and the reproducibility part. In this part we tried to reproduce the measurement invariance checks for those studies that did them, but also perform measurement invariance checks for those studies that technically had to do so but did not. And I think you can imagine this takes a lot of time, so much time that we did not finish in time for this talk. Luckily before performing this study we also did a pilot study on both the reporting and the reproducibility part. And I'm going to go through these results quickly now. In our pilot study we had three journals, plus one psychological science and also the Journal of judgment and decision making. We sampled 60 articles of 20 from each, all of them indicated that they shared their data. In the 60 articles we found 118 comparisons of which 36 were group time comparisons with a psychological scale to according to our definition of scales. And for those 36 we wanted to do a measurement invariance check. So all of them indicated to have open data for only 30 comparisons we could actually find and open the data. Then of those 3025 had interpretable data so that means variables have names that we could understand the information was interpretable. And for 15 of those we could actually run a measurement invariance analysis. In the others, either the sample size was too small or the model did not converge because it was too complex. In 11 of those measurement invariance held, and in six of those the level of measurement invariance that held was actually okay. So that means that we can actually determine the mean differences between groups and being able to say that the differences that you found are not due to measurement differences in the construct. So our conclusion for this part is that even if the study states that they share their data it's not guaranteed to be actually shared or interpretable or usable. If it is usable the measurement invariance checks often fail to achieve proper measurement invariance levels. It cannot be certain whether found group differences are actually due to group differences, or that they also measure measurement artifacts. Our study has a few limitations of course, our main limitation is that our inclusion criteria were quite strict, especially related to what we classified under our psychological scale. So there were around 400 scales that we found in our study that measured some form of human behavior, but there were not included a psychological skills here because they did not adhere to our definition. And many of these skills were behavioral tasks in experiments, for example when reaction time was measured. We also compared only categorical groups like treatment versus control. We only use open data, meaning that our sample is not very representative. And I think one can make the argument that researchers that share their data are also more aware of people possibly analyzing that data, and that they are more careful in reporting. So that could mean that the situation for articles that do not share their data could maybe even be worse. Our sample was quite small, we didn't start out small but as you could see studies kept dropping out and in the end yeah we have results across seven articles which is not a lot. The data is nested and statistical tests here would be underpowered. But I want to end on a more positive note, because there are things that can be done. If you're a researcher comparing groups or time points on a psychological construct it's clear right please do a measurement and variance test. It's super easy to do an R with the Levan package. And if you can't do it, please report why share your data, but also please share your code. We're very happy with researchers sharing their data but for reproducibility purposes it is necessary to share your code as well. If you prefer validated scales, don't just add or remove items, if you do add or remove items, make sure that you do a validation study, check for measurement and variance check that what your, that your construct still means the same thing as it did. For reviewers or consumers of research please be critical of measurement reporting and articles and see whether enough information is reported. For a reviewer of a study, ask for more information on measurement properties. I personally I don't think there's any malice from people not reporting measurement and variance checks right I just think this is a thing that many people don't think about of even doing. They don't know it's important they don't know why it's important. And I'm now thinking I should also add another heading for educators asking them to educate their students on measurement and variance. Making people aware will help a lot. If I can leave you today with the knowledge that measurement and variance checking is important. You should do it. Then I'm already very happy. Thank you for your attention. I'm going to go through there is a question from Robbie if participants are randomly assigned to the experimental and control group is it then necessary to assess measurement and variance yes. I think so I mean the random assignment ensures that there is that the subjects are independent of one another but it doesn't really relate to the measure I would say I also see that Jessica is typing an answer to this as well so I'm going to leave her to it and I'm going to move over. I'm going to stop my share. And I'm going to introduce to you the next speaker, the next speaker is Andrea Stoovebel. Andrea Stoovebel is a PhD student and a colleague of mine at Tilburg University in the Netherlands. Andrea is conducting a registered research report in stereo type threats and we'll be speaking to us today about that, and about measurement issues. Andrea please go ahead. Yeah, thank you. Okay, I have to have to apologize the neighbors of the party so if you can hear some noise in the background that's them. But today I'm going to add another concern to your research and I'm not talking about skills but actually the context in which we collect our data. And I would like to present to you my one of my PhD projects, which talks about bias and experimental research and the effects of time limits on gender differences. And I did this together with my two supervisors, Jelte Wiegert and Inge Svabe, before the next slide. Yes. And I would like to start with this quote. Namely that measurement changes the measure, which to me this means that we should take into account the situation under which we collect our data and think about whether the situation also actually affects our measurements instrument. I would like to argue throughout this talk that yes it does. And I'm going to motivate that with one example, namely time limits, and I can imagine that there are many other factors that may also influence how your skill functions in your sample. So time limits can be introduced in experimental research, of course, for various reasons. And then just named here too. But first of all it's of course a practical choice we don't have unlimited time or participants don't have unlimited time. So we have to impose some kind of time limits to make our studies feasible. And second we can also impose them as a design choice will focus on the letter because that's what I've been doing. And we both could be very valuable options. But they also come with some caveats, namely if you impose time limits, we can also create a speeded task. And the psychometrics. We generally define two types of tests and the either power test in which we give a participant a scale. We asked them to complete all the items on the scale. And we also generally give them enough time to answer all these items. And then they're done the discord or items, and they're judged on how well they're doing based on the correct, their number of correct answers. A speech, a speed test on the other hand is a test for participants received quite a lot of easy items, and they can never finish all these items within the a lot of time. And then they're not scored on how well they do on these items, but rather on how far they come at a scale. And these are of course two extremes, and not a skill will not just be a power test or a speed test but they will always be some kind of mix in between. And it doesn't have to be a problem. But it can be a problem when we introduce some kind of time limit and speed starts to become a become a factor in our experiments, but we still want to judge our participants on how well they're doing on our scale, and we're not that interested in their working speed. In such a situation we have a recall speed a test. And this for example affects reliability estimates validity estimates, and would like to focus on the validity points today. Namely, if we add a time limit, we introduce a speed component, and it also means that our skills no longer uni, uni dimensional. So for example, if we have a mathematics test and we give people enough time to answer all these items, we can assume that our test only measures mathematics, if you just think about this very simplified. But if you introduce a time limit and a speeding components, we also measure how far participants reach in the scale. And it may be that these two factors are even correlated that people who are better at mathematics also end up further in the scale. So we no longer have a one dimensional scale that just meant measures or instance or construct of interest, namely mathematics. So we have a dimensional scale where we have both some kind of measure working speed and ability and to make matters worse, gender differences also come into play. So we not only now have a scale that doesn't measure one thing anymore but may measure two things. We also artificially introduce differences across groups. Gender differences in speed at this have been found, for example, in America racing toss cognitive toss in university exams. And I'm going to add like social psychological experiments to this. And I'm going to use some stereotype threat data to illustrate all of this later. So in general, the takeaway message from this is that we do find gender differences on all these toss if they're performed on a time limits. And that for example for an American reasoning, this difference between genders is reduced when his time limit is removed. So what we do we do. We have two problems. We have for speed a test that we administered on a two strict time limits. So we know that we probably have two dimensions in our data in the working speeds and ability. And we also know that we have some kind of gender issue going on maybe. And then I would argue that we cannot not just ignore all these issues and fit our own model and analyze the data, but rather we should explicitly model this mechanism, missing mechanism. And that's what is that is what I set out to do. And for that I use a two dimensional parameter item response model proposed by glass and Pimentel in 2008. In other words, just say that I use the model that can model both these latent traits I just talked about so it can model both mathematics ability, missing this propensity, and also the correlation between these two traits. So we can see whether indeed people are better at mathematics are also able to answer more items within a lot of time. We can also talk about these models is that they're very flexible. So we can also add covariates, and in my study at the gender, which couldn't add anything you want. If you look at this in a picture it would look a bit like this. We have observed gender variable that to latent traits mathematics ability and missing this propensity. So missing this propensity is a bit of a misnomer, because it's not actually how much items you miss with more your tendency to respond to items. So it's a phrase the wrong way around that sometimes is confusing. But in general, the higher you're missing this propensity the more likely you are to answer an item. So the higher the mathematics ability reflects how good your art mathematics. So the higher your ability is here the more likely you will answer mathematics items correctly. And we propose that gender negatively influence both of these latent traits. And a negative effects for gender on mathematics ability based on the observed literature where we certainly do observe a gender gap. And a negative effects on missing this propensity due to the other literature, mostly from intelligence research that does suggest that there. There's a difference in which men and women answer items. And we develop these models. And we wanted to apply them to a second large project of my PhD, namely a registered replication report on stereotypes with which we collected quite a lot of cool data. For that I would like to give some background on these data. And then we have stereotype threats, which is a social psychological construct to also, but also tries to explain gender differences in mathematics. And the idea is that these gender differences creators due to stereotypes about women's mathematics performance. So the idea is, for example, like this Barbie women can do math without us go shopping and not be concerned with any stem related things. And because the women when they are confronted with a medic, the mathematics test, the stereotype is activated and their performance suffers. The lab that would look something like this. So for example on the left side with the stereotype condition, where we would tell women, well girls really suck at math. And on the right side with some kind of control group where we don't tell women this. What's expected is that the women in the stereotype threat condition, then underperformed compared to men, whereas women in the control condition, from similar to men. There's quite a lot of literature on this. But like I said, we didn't are on this and that's probably we did that because there's probably publication bias in this literature. And we tried to figure out whether we can replicate a stereotype threat effect. If you do that with a large group of people. And we were very happy and very hard working on this and then the crisis hits. So currently I have four samples. But we're still very happy with the data we have and we still managed to collect data from about 800 participants, about 600 women 180 men. And I'm thankful to all these participants and my co officer on the bottom of the slides for all their hard work and that we still made this work, despite the corona crisis. But these crisis doesn't be nice to us but we also haven't been very nice our participants. They gave them a mathematics test. And we gave them only 20 minutes to complete these 30 items. So they've less than one minute per question. And I also heard from a lot of them that they didn't find it's very nice to be able to do to have to do this under such a time limit. But it's also cost us to have a heavy speed test. And I also still want to share the results of the RR review that we have until now. And this is coincide. The negative values mean that the people in the stereotype condition score lower than the people on the control condition when we look at women only. So, after all work. And this was what we have until now and I hope that we can add a lot of labs later on still in the future, when we are allowed to go outside again. But now back to measure ones. We didn't find these effects of stereotype threat on the performance of women which is in line with the prerequisite studies that are currently in the literature. And we decided to have a look for alternative explanations. And we came up with a couple of the standard ones like a limited sample, maybe P hacking, maybe publication bias, but we also thought, maybe it might be our own experimental setup. Because 30 items in 20 minutes is very limited. And maybe this type of effects on our participants behavior. And we do found that these strict time limits are very common within stereotype threat research. And a lot of studies include such time limits that participants have less than one minutes to answer a question. And it also results in a lot of missing data, which also with which we also found that larger numbers of missing data have been related to larger stereotypes of threat effects. So basically the more missing data you have in your study, the larger your effect size will be. So I wondered, maybe there's some kind of double mechanism going on. Maybe these women in the state of that condition do not only suffer from a stereotype threat manipulation, but also from this gender difference that we find on the time limits. And then we end up back at our model situation. The same models that I just discussed completely at the beginning. These models of which we are allowed able to model multiple latent rates like so, like here working speed and mathematics ability, but we're also able to include covariates. And on this picture to the right, I included the stereotype threat manipulation, which is a negative effects on the relationship between gender and mathematics ability and gender and missingness propensity. And it's right to fit four models, but it turned out to be quite difficult. This item response models are very difficult to estimate and have a hard time converging. So the idea was very nice and this is what we pre-registered to do, but the reality turned out a bit different. First we started with some descriptives because that's always works. And here we can already see that there's a relationship between the amount of items that people attempt to do. And the amount of items they also answer correctly. And the top of the slide with the women on the bottom of the slide with the men. And these pink bars are the number of responses to an item. So we didn't have a look if they were correct incorrect, but just where people answered an item. And we can see that pink bars slowly decrease over the test length, which makes sense. They only had 20 minutes after a while people start dropping out where people in the beginning still answer items but towards the end they don't anymore. But we also see that the blue bars follow these pink bars quite well. We can see that people who do answer the items towards the end of the test also answered them correctly, which indicates to us that if people don't answer a lot of items towards the end, they're not randomly guessing anymore. They also actually able to answer them still to their ability. And we also find this in our results. In our table, if you notice model forest drops because it didn't converge. But we did find a correlation, a positive correlation between missingness, propensity and ability, which means the more able students were able to answer more items. We also found this effect of gender or missingness propensity. But as you can see, and this interval and for sure the upper bound of the interval is very, very close to zero. So, I wouldn't put any money on this because it might be that if you drop a participant, or we run the model with slightly different settings or different prior so we get a different result. But we do have an indication that something might be there. But we do have some results. We know that there is this correlation between missingness, propensity and ability, and that there may be an effect of gender. So what now. First of all, I would like to see a replication of this, my data are very limited. We know that there is a threat effect on the some scores in these data, and it will be cool to replicate this type of modeling approach to hysteria threat data set that does have an effect in it. Only up to my knowledge, there's certainly no data set large enough to try this with currently available. And it may be stronger in the US like we only use European samples, and it may be that we could find a stereotype threat effect when we use a different type of sample. And if we try this, and maybe we still don't recover it, it may be time to study other reasons beyond stereotype threats in social psychology to study why there might be a performance gap between men and women. And if we go back to measurements and to remove a bit away from stereotype threats. I also have two slides I would like to remember from all of this because I talked a lot about stereotype threat, but this of course applies to all other fields that have some kind of experimental task with a time limit attached to it. We do find some indications in my study also but also across the literature that ability and missingness propensity are correlated. That we cannot just ignore a missing data mechanism. And because these are these two latent traits are correlated or missing data is no longer ignorable, and we have to do something with it to avoid, for example, having biased parameter estimates. In the case of rt this may result to the underestimation of people's scores or wrong conclusions about how your test functions in your sample. I would like to argue that indeed measurement does change the measure time limits may affect the conclusions we control for research. And even though we installed them maybe at the beginning of her study without giving them much fault, they may actually impact the conclusions we control at the end of them. I think we then in the end have two options, you could try this modeling approach, which may be a bit post hoc, like we have our data, we know that there's a big caveat in it, because we have a lot of missing data we cannot imputes or ignore. And you can use very difficult models. And to still find the relationship for your region, where originally looking for, which with me was a stereotype threat effect. We can also take the other approach and take this into account at the beginning of her study, but we're still designing it. And they're wonder. And I'm going to contradict us for a bit. And whether we actually need all the items that we plan to administer. And I don't say just drops items from a validated scale, but for me, I use mathematics that's not validated. And maybe I could have used less and still got the same results. Or maybe we give participants just a bit more time to answer all the items. And that will be my takeaway. Please think about their setup of your study beforehand. And if you can't do so, take the choices you made into account when you draw your conclusions. And still sorry, Esther want to plug myself. Projects are openly available for the are the pre registration the data code and manuscript are available online. For my missing data study pre registration data and model codes are also available. Thank you. Thank you so much. Sorry for interrupting you quickly. I am, I think your final notes on the number of items is very interesting. I think we can have an interesting discussion on this on being able to leave out items sounds very interesting. I'm not going to delve into it now. I'm going to go to the question that was asked in the Q&A on a very principle level. Does it make sense to measure ability independently of time is ability always is not ability always achievement over time. I have to think about this question a bit. It's a hard one, I would say. Yeah. There's of course always a bit of a time and components to ability, because the way I see it we can never create a scale that all people will be able to complete within a reasonable time limit. So there's always choices there that's no one and you want to give people enough time. But on the other hand, you also want to keep it practical. So there will always be some people where you keep this speed accuracy trade off. So, I find it to be difficult to answer this think ability will always be related to your working speed, but we can try to minimize this as much as possible by creating good skills. Thank you. Okay, I'm going to move on to our final speaker for symposium. Thank you. Dr Jessica flake is an assistant professor in quantitative psychology at McGill University in Canada. She will be talking today about threats to the validity of replication research as if you haven't heard enough already. She investigated with her team studies from the reproducibility project psychology and many labs to Jessica if you're ready, please go for it. So, hopefully, you guys see screen two which has a chat box, but I'm going to take the chat box off. So now you see the display version over there and I'm going to give myself the extra slide. Okay, all good with the. Okay. Yeah, so I broke all the rules and forgot to add the science slide I'm really sorry thank you to everybody else. And also I'm excited to talk to you guys about some. I talk about these issues a lot usually to quant folks or to social folks but like there are people here that are interested in metascience we talked about metascience measurement same time. I'm going to talk about measure practice in the time of replication crisis obviously we're all living through a different crisis and it was, you know I've been about have maybe a little bit of a break from the replication crisis because we're living in a pandemic so I do want to acknowledge that. So, I don't have a ton of time but my plan is to provide a little bit of background, and then to focus on measurement practices and how they connect with replication research and close out with what I hope is heartening next steps of how we can replace maybe large scale replications with large scale measurement studies. So background, what is construct validity and how does it relate to methodological reform. So we've talked a lot about actually there are clearly measurement walks here. I'm going to talk really big picture I'm not going to get into the details, but generally there's just this idea and psychology that we're measuring things that are inside of people's heads. We don't know how to measure them, we don't know how to elicit the behaviors that will give us some indication of what these things inside people's heads are. We do this a lot of different ways but what happens is we get some number in our data set, and the goal is that the number, it has some meaning. So, if my motivation is higher than somebody else my score on that number is also higher. This is what construct validity is about is about the meaning of scores, tracking the constructs that we wish to measure. There are a lot of different ways to think about measurement to define construct validity. There are standards for measurement in the North American tradition also there are European standards. There are books of construct validity theory and psychometric methods I'm not going to explain all the details but I would say the short of it is that we have this challenge with measurement which is that it's very theoretical. So, you have to think, you have to know what you want to measure before you can decide how to measure it. And then once you decide how to measure it there's all the statistical modeling stuff you have to do and it's it's quite complicated we've been talking a lot about psychometrics here today. You don't have to maybe know all the details of all the different ways to measure but I think the takeaway is that there's a lot of stuff to do to make sure that when you interpret that number it does impart the meaning that you think it imparts. So, I'm going to define construct validity as the degree to which evidence and theory support the interpretation of tests and scores for the proposed uses of tests. There are other definitions you might use. But in general, we can think about validity pertaining to the interpretation and use of a score, and something that's really important is not necessarily a stable property of the instrument. So, you measure and we've all kind of been saying that throughout this talk because of measurement and variance because of the context you measure just because the instrument had some properties before doesn't mean it's going to have those properties to give you the interpretation you want later. So, evidence should be gathered in an ongoing way. And there should be theoretical and quantitative evidence. This is the standard construct validity stuff that you think about when you learn about research design when you learn about measurement is particularly relevant to replication research in a way that I think has been neglected. So, there has been a replication crisis which is a spurred a methodological reform movement, and the whole focus of the crisis is whether or not our conclusions of our studies are valid. So we say there's some effect. Is there really effect is that the truth is that a valid conclusion. And there have been a lot of discussions about how to make sure our conclusions are more valid, or more replicable. Because our statistical practices were bad. So people were p hacking, they were hypothesizing after results are known engaging in selective reporting and questionable research practices. methodological reformers myself included swooped in and they're like we can solve this problem. We're going to pre register we're not going to have as much analytical flexibility by coming up with a new publication model but a register report we're going to plan our sample sizes better we're going to be more transparent and we're going to conduct a lot of replication research. And that's all great. I'm a big supporter of that. And in those conversations, is this more foundational or fundamental role of measurement. This is also things like research design theoretical expertise, and from what everyone was saying before and luckily I don't have to say much about it. There are a lot of problems here at these foundational aspects of the research process so there are a lot of questionable measurement practices there are a lot of instruments in use without any validity evidence. So, if our if we're going to start replicating research as a way to understand how valid our conclusions are, we have a lot of research is sort of cracked from the bottom. So, the general background is that there's evidence of construct validity is necessary to interpret a study questionable and bad measurement practices are common. And then a replication crisis which spurred a methodological reform movement. But the reform movement needs to address measurement practices because we got a lot of problems going on, and because we need good measurement practices to ensure that our studies are valid and that our studies are replicable. So let's talk about actual replication research. I'm going to share with you two systematic reviews of replication research that I've conducted the reproducibility project psychology in the many labs to. So the first one we're going to talk about is the reproducibility project psychology, perhaps many of you have heard of the study as a part of the hosts for this conference, but the Center for open science attempted to replicate 100 studies taking taken from prominent journals and psychology, and they were direct replication so they were meant to be as close to the originals as possible so like use the same exact materials. And there was, you know, a good faith effort to power the studies are to make sure that the materials were there. So we reviewed all the original studies and all the replication reports, and we looked for things like what were the measures, did the measures have a citation to a validity study were there any psychometric analysis. And I'm going to focus on item based scales like questionnaires and surveys, because best practices for validation are straightforward and I know about them. And I will say that the this paper is under review it is a request for revision but it's a toughy so I don't know if it'll be published anytime soon. So basically we looked at all the original studies and all of their instruments. We found 362 instruments and 53% of those were item based scales. So that means that a primary variable of interest that they were measuring was measured with a survey or a questionnaire. And all the surveys and questionnaires were only one item. This is common one item measurement is common 44% of those item based scales had no reference to a source. So they appeared to come out of nowhere, or they were ad hoc or they were created on the fly. 56% had some citation so you can make a generous assumption that that citation is to ability study. 25% of the item based scales had a reliability coefficient 20% were reported with no information at all. So my appraisal of the situation is that the original studies, many of the instruments didn't have any validity evidence they didn't have basic information about their psychometric properties. So what happens in a replication study when the original research lacks validity evidence. First, it's not standard practice for replicators to report any psychometric information in replication reports. So less than half of the replication reports reported any even reliability information, and only a few reported any other psychometric analyses. So one way you could interpret this is that replication studies have even less validity evidence than original studies. So in 100 studies, the authors actually explicitly said there seems to be some problem here with with the measurement. So, and looking across all of the studies and all of the instruments and this lack of validity. We summarize this into four broad challenges that replicators are going to face from a measurement perspective. So if you have limited information about measures, no or limited validity evidence for measures measurement differences from original replication, and then a more severe measurement difference which is translation. I'm just going to give you a quick example for each of these from the reproducibility project. So for example, and replication study 46. In the original study, the there wasn't a lot of transparency the item wordings were unclear. So the replication study actually ended up using different items. And the evidence is just a lack of information. So you have a lot of ad hoc instruments with no validation. This is just sort of a case where the foundation of the original study is really shaky. I think of this as potentially replicating garbage. So you have instruments with no validity evidence and now you've gone through and you're spending resources and time to replicate those studies. So if you have instruments with without validity evidence, you're going to see measurement differences. So for example, and the 92nd study, the reliability in the replication was so low as to be unacceptable. Or in study 41 two items were significantly and strongly correlated in the original, they were not even significantly correlated in the replication study. Sometimes you see different factor structures. Which is a more extreme version of a measurement difference. So 40 scales were translated in the RPP and only eight of them had previous validated or the translated instruments that have undergone some sort of validation process. I can't say how many of those translated instruments were not invariant or had psychometric differences across groups but we do know that positive and negative affect schedule was used in two of the studies and there are published studies showing that there's a lot of different language and cultural differences. So there's a reason to think that when we do these replication studies, we're going to have measurement non invariance due to translation issues and or cultural issues. So measurement introduces a lot of challenges in the conduct, planning and interpretation of replication research. Sometimes it makes direct replication impossible because the original measurement was so in transparent you can't figure out what happened. The replication study just inherent a bunch of flaws that the original has a lack of construct validity evidence. Or sometimes the psychometric results really conflict with the original. And now you're left with this question well what does it mean in my replication study if I didn't replicate the fact that the items are correlated. What do I do next. In the RPP it appears that multiple measures produced invalid or not comparable scores from the original, like they're not comparable to the original study or they're not valid in the replication study. But for many it's unknown. I mean even though the RPP data are open. They're really limited in sample size because they're powered to detect usually a simple statistical effect. Whereas a psychometric analysis might estimate dozens of parameters. I think this is just something we need to think about and we run replications if we want to go backwards and do measurement research, we need way bigger sample sizes than when we tend to power replication study. So looking in the many labs to the many labs to is a little bit different in the RPP because it's bigger sample sizes. So the many labs projects will pick a set of effects, and have a bunch of labs run that same study in a slate. So more samples and bigger samples, but a lot of language heterogeneity so for the many labs to instruments were translated into 16 languages. So data and because there's all the different data collection labs there are data that are large enough for some instruments for some groups to evaluate the factor structure that was assumed by the replication study and evaluate the reliability in the replication sample. So this paper is published as read led by my graduate student Ray Shaw. So she deserved all the credit. So we did of course look at all the measures. So we looked at how many measures there were and if they had any validity evidence. Then what we really were focused on was which instruments we could actually do a little bit of psychometric testing for so test their assumed factor structure and test their reliability. So with the review of instruments just what the instruments were like if they had any validity evidence if they had any reliability evidence they're very consistent with what other people are saying here so a lot of ad hoc or on the fly instruments, a lot of instruments with limited validity evidence and a high prevalence of one item scales. So of all these instruments, eight of them had more than three items. So what we did was we looked to see if the replication study used assumed a single factor model, and then we tested whether or not that single matter factor model fit the data across the whole replication sample. This involves confirmatory factor analysis and structural equation modeling sometimes with categorical outcomes sometimes continuous outcomes. You don't have to know much about that but you do have to trust me. I would say in general, the results of these models are mediocre. Some we might say are acceptable, but there's at least three in this study that most of us would agree are really unacceptable. So the idea that these instruments measured a single factor and that a total score would represent the data and produce a valid score is not really tenable. So when we run a single factor CFA, we want decent model fit we can argue about what decent model fit is but we would all agree that an RMSCA 0.26 is bad you want that number to at least be less than 0.10. CFI and TLI you want to be close to one. These are not close to one at all. We've got some you know, kind of this one, this one's but that RMSCA is terrible. We have multiple instruments in this large scale replication study for which dozens of labs contributed and spent time and resources from the instruments themselves don't even produce a unidimensional model which is assumed. Another thing that we looked at was reliability. So, even though we didn't meet the functions for reliability, we looked at the reliability estimate for all of these instruments. The average reliability is lower than we would like it to be. Reliability does influence the ability to detect an effect. So that's a problem. And labs vary in how reliable their data are, and translated skills have lower reliability on average. So this is just one of the instruments. These are just Chromebox alpha the labs are organized on the y axis. We have translated instruments here and untranslated instruments here the average Chromebox alpha for this instrument is a below 0.5. But what you can see is that translated instruments are more variable in their reliability and their reliability is lower. So this is a key source of heterogeneity in the instruments. Yet, these data are pulled often to interpret replicated effects. Okay, so the takeaways are that direct replication research is on the large scale is becoming a norm. Replication research is stymied by pervasive Q&Ps in the original literature, question measurement practices. Concept validity evidence and replication research is entirely missing, or it's not supported by the data. Replication research isn't valid or interpretable under these conditions, and I will make a strong stance. I don't think replication results are valid or interpretable when we have all these measurement issues going on. It's very sad. It's a pretty well keeps us busy. I think the good way to think about it is that these replication studies have contributed a lot to our understanding about how to do replication research and because all the data are open and we're working together on this we're going to be able to solve a problem. So let's talk about next steps. I think that what we can do is conduct large scale construct validation instead of large scale replications. So what I said earlier is that replication research isn't valid or interpretable under these conditions. What we can do is improve the transparency and the rigor of original research. That's a long term goal, but we should get right on it. We could only select studies to replicate that have strong validity evidence for the measures. So basically, we should waste our time trying to replicate garbage. I would say this is my view, but there is a counter argument to this which is important, which is that replication can help us identify garbage. And that failed replications are a more effective way to correct the literature than commentaries or methodological critiques. Like does anybody care when we write a paper and say that instrument was bad, it doesn't seem like it, but when there's a massive failed replication, people listen. Let's compromise and say that why don't we conduct replication studies differently. We can add that psychometric and construct validation aspect to the replication process. So I'm working on this with a psychological science accelerator, which is a distributed laboratory network that runs large scale and collaborative studies, often replication studies. Anybody can become a member so go ahead to the website and join. We'll run a really big study and then we'll pull the data to estimate bigger effects or we'll do replications where we can estimate effects on a more global scale. The other thing that I've been working on is using a PSA study as a test case to develop a construct validation pipeline that we can adopt when we run large scale replication studies. The pipeline does something like this and I am aware of the time. So we're going to translate the instruments and we're going to get some qualitative feedback from the translators. We're going to do basic item level descriptive analysis. Look at the factor structure and reliability within groups. This could be the lab level or the language level I've been focusing on the language level right now. Conduct measurement and tarant and variance testing particularly for translated versions compared to the original version determine the level of invariance. Use the qualitative information from translators to identify partial measurement variance models and anchor anchor items. Again, if there's configurable invariance, which is that the factor configuration of the items to factors is the same across groups. Use use a multiple group model and alignment analysis. But importantly, since open data is something that we always have as a result of these studies. Produce a validity report for each version that runs through this process so that when people reuse the data or they interpret the replications they have all that psychometric and validity information that they need. Use to produce a psychometric pipeline that we can all use so that we can improve the evidence that we get from large scale replication studies. So my takeaways are that measurement practices are questionable and bad and the original literature and this is really problematic for replication research. Large scale replication research is going to require new methods and new research practices. I can't even figure out how to do some of the stuff that we want to do. But right now it kind of seems like replication studies have even less validity evidence than original studies. The finalization of constructs and measures is a key issue for methodological reform, and it's a part of how we can understand the replicability of findings. It would be worthwhile for future large scale studies to focus more on the constructs, and less on replicating specific examples of experiments. So that's what we're working on at the PSA. I'm going to show it down there because I think I'm over on time. Thank you, Jessica. Just like the idea of a psychometric pipeline. Very appealing. There is one question in the Q&A. It is about translated scales. Do you have any ideas why translated scales are less reliable? Could it be that the scale was not translated in consultation with members of the community? So, translation will go differently in these studies. So there are some backwards, forwards, and review translation techniques used in some of the studies. And I think in the mini labs you could think of that process as getting better than in the RPP, like a little more rigorous. But I think the translated scales are less reliable because translated items are poorly understood by them. I think that despite really rigorous translation methods, items can still not quite translate to the people who are reading them. So we have seen with really rigorous, in the PSA, with really rigorous forwards and backwards translation and item review, that when we go and ask a native speaker to just read it again and give us feedback, they will identify problematic items. And even if the translation might make sense, it might psychometrically not be equivalent to the original item, which may have imparted some cultural meaning. So I think that that's what's going on. I think that it's possible. Actually, my negative take is that it's not possible to translate all instruments because constructs don't exist the same way that they exist in the language that they were created. We want to measure happiness over here in Canada and we go and try to measure it in other cultures, just not even the same thing. So it doesn't matter if we translate the items. But I do think that for some constructs that we think have a global or cultural relevance more broadly, it's possible to make validated and like translated items that are going to be interpretable equally. But that takes a lot more work than just having a couple people translate it. We have to, like this is what PISA and TEMs do, right, they spend a lot of time making sure the items are equivalent, and it's a mixed methods process. Should I look at Wes's question? Okay. Hey, out there. So have you considered the role of fitting propensity as advanced by your colleague Carl within this work? Some models will yield good fit and original replications because those models have inherent to fit well to any possible data. I don't know what I see for propensity. I kind of feel like a propensity spore. So I'm not actually sure what you're referring to here. Maybe I'm not familiar with these models. By a factor models tend to fit really well on most data, for example, there's just SCM models that overfit data. I think that's the end. Yeah, I mean that I if that's happening that's a little bit different than none of the models fit at all and I've been a little tiny bit not even a little, which seems to be the modal outcome of these analyses so it doesn't you know, models that fit really well for some like reason because they're very complex or whatever. If they have the inherent tendency to fit any data, I'm not sure that doesn't appear to represent the bulk of what's going on. That could be like a ladder stream issue that we could think about though. I think of course, whether or not a factor model fit doesn't necessarily mean that it's, you know, the scores are valid, and that the construct makes sense and people interpreted the items the way you think so there's definitely a lot of room for issues there but I would say that's going to be advanced. Right now we're working on instrument has ever even anyone ever thought about it before they published all of their items and some reasonable background information and we're working on very low level things now because all the advanced stuff. Maybe we have in the whole field should maybe think about it this way all the advanced stuff that we were working on almost seems irrelevant, given the state of reporting. So, maybe we can advance to that later. West. Thanks. Thank you so much, Jessica. I want to thank all of our speakers. I want to thank the organizers. I want to thank the attendees here. If you have any questions for us if we missed your question if you want to know something more than if you want to interact with us feel free to reach out to us via other channels. I think we are all very easily findable on the internet. There's also the slack and the remote available to you of course. There are no more questions here so and we're also running out of time so I would like to close this session thank everyone again. We'll see you around here somewhere later during a session maybe and let's thank Esther for organizing us thanks for having us here. Thank you, thank you. Thank you, Esther.