 Just to recap before I start so we saw there are different types of measurement error The main one is random error, so that means when people are not consistent So we asked them the same thing the true score is the same So let's say attitudes towards immigrants is the same, but somehow they're not consistent So sometimes they're slightly more positive. Sometimes they're slightly less positive So that we would call random error and then we talked about systematic error Which me and Daniel don't really like that or validity or So that's when for example, I ask you using one method So for example, I use a 10-point scale and you tend to kind of always be higher more positive than when I use a 5-point scale Okay, and that's systematic because of the way I ask you just because of the response scale So that's the way while how in survey methodology we usually conceptualize the method So a systematic effect due to method is because of the response scale and then we saw for Tobias That's quite different. So what they are looking at the method is actually the rater But actually the method that you used to estimate that is the same So for us you can think of the different ways to ask the question is the same as using different raters There are different methods to collect the data And we don't really use interchangeable methods like you use so all our methods are kind of fixed so that's a difference and another nice similarity with what Tobias was talking about is that This work that is with with Daniel Obersky Is actually we're also going to this direction of designing The questions and kind of manipulating them in order to estimate measurement error So it's also really design based and then the model is really based on kind of the design or experiment We we built on top of it Okay, so we have some concepts we want to measure so for example in this cartoon How was your pain today? and then what we're saying when I say method effects in In survey methodology we mean that if I use a different scale instead of the smiley faces I would get different results But as you look at this image you realize that maybe that is not the main issue with asking this question Yeah, there might be other reasons why you wouldn't give a true score So for example there might be some kind of peer pressure to answer in a particular way Yeah, so that's what we call Social desirability in survey methodology. Yeah, so we want to present ourselves in a positive way And we really don't say the the truth So a nice example of that is asking people how many sexual partners they have Men tends to overestimate and female tend to underestimate and you'd think somewhere in between the true score should Should be there, but yeah, so that would be systematic error due to social desirability We also have other types of measurement error. So for example, there's something called acquiescence or yes saying So let's say you don't really want to think about the questions anymore And you just agree with whatever the person asks you and again That's not very good measurement, right because it's kind of it's not related to what I ask and it's it's just It's systematic because you always agree, but it's not really what you want to measure so the methods we saw so far the Multi-train multi-method actually makes a important assumption. It says that the only Systematic bias we get in our answer is because of this But we know that's not true There's a huge research in survey methodology that we know social desirability is a issue We know acquiescence is a issue. There are other things like extreme response styles and so on So the question is well are these methods correct if we assume that everything is missing when we have this huge literature That says they're present Okay, so that's kind of what we're trying to do We're trying to get rid of these two main assumptions of the methods for dealing with measurement error So one is that they look at one type of measurement error at the time and then often they either look at means or variances MTMM is quite good because you can do both of them, but test retest for example, just look at random variation and other types of Designs that are common in survey methodology like split-ballot designs. You can only look at mean differences You can't look at the amount of variation that is systematically biased So our aim is try trying to figure out. How can we move away from the methods that exist so far and try to bypass these assumptions So we're going to develop something on top of the MTMM In which instead of assuming there are only two sources of correlated variation We're going to try to estimate multiple sources of correlated variation or systematic variation And one one issue with a lot of the multi-train Multi-method that Melanie didn't really have time to mention is that the order is not randomized So we saw that actually memory effects might be important So one way to get rid of that because we have two measures from the same individuals We could randomize the order of the question So that's one way to get rid of this assumption of non-random presentation that the MTMM has So what I'm going to do is I'm going to walk you through an example how we develop this new method, which you call the multi-train multi-error So first of all I want to thank Understand Society because they collected this data for us for free and As you see we have 56 experimental groups I think they regretted that they agreed to do this after they started programming this experiment But we're really grateful because as you see we can do things that I don't think anybody ever did before with measurement error So what we have is we have a wave. I'll just look at one wave of the Understand Society It is a household representative sample. It has around 2000 people and it used mainly for methodological research Okay, so a lot of the things that will go in the main survey of Understand Society will be tested in the innovation panel So what we have is our empty ME experiment, which I'll explain in a minute We're trying to measure attitudes towards immigrants and the reason why we chose this is because it's a very difficult topic to measure We know attitudes are Yeah, they're biased by really small things like the way you ask we know it can be a sensitive topic So social disability might be a problem. So that's why we kind of went for like the most difficult thing you can measure and then I'm going to explain how we ended up with 56 experimental groups and As Melanie mentioned we have to have two measures of the questions for the same individual. So People would be asked the questions at the beginning of the survey and then at the end of the survey in slightly different ways And in between there will be around half an hour of survey Yeah, there's some limited research saying that 20 minutes is enough But that's one of the assumptions of the methods. So keep that in mind So one way to deal with that is again, like I mentioned we randomize the order of the forms So these are the questions which are very popular. They appear in the European Social Survey and EVS a number of large surveys. So first you have Free questions. So for example the UK should allow more people of the same race or ethnic group as most British people And then you change this italic part with different people of different race and ethnic group and so on And then we have three more questions here Do you think that immigration in general is good for UK's economy or for UK's cultural life and so on? So these are the six questions we want to ask and they Yeah, they measure attitudes towards immigrants So so far we saw that in order to estimate MTMM you have to kind of Randomize the method that you use so we can use two points and five points And then we see if they agree or not and that's how we estimate method effects But what we are thinking well, why just randomize one thing? Why don't we randomize multiple characteristics of the questions at the same time and estimate multiple types of systematic errors? So for example instead of just manipulating the method Which we did using either a two point or a eleven point scale. We also Manipulate two other things so for social desirability what we do is we change the stem of the question So for example here the stem is that the UK should allow more people of the same race And what we did is we randomize if they have if they get allow more or allow fewer and The the idea behind this that we kind of imply was the social norm We're kind of saying most people think that this is the social norm And if they tend to answer in a socially desirable way, they tend to agree with the rest of the people So that's kind of subtle. So we say, okay We'll see if that works or not and the other thing that we manipulated is if the response scale is agree Disagree or disagree agree and again the idea is that if it's agree disagree people would tend to agree easier It's easier to agree with that and you tend to get more acquiescence Yeah, so if we combine these we actually get eight ways in which you can ask the same question So here is just the first question UK should allow fewer people of the same race. So the first wording Social desirability is higher. It has a two point response scale It is agree disagree and it's negatively worded. So it has fewer in it So these people were asked the negative question and then the response scale was agreed disagree with two points The second way in which I can ask the question is wording to it's again two points agree disagree But now it's positively worded. So we should allow more and then Wording free is 11 points agree disagree negative. So these are all the different ways in which we could we could add the same question Yeah so these are The wordings and what we did is people got one wording at the beginning and one wording at the end and Then this was randomized. Yeah, so if you combine all of these all possible combinations, you get 56 experimental groups So this is what we did in our in our design Okay, so Do I have one of these graphs? Oh, yeah, of course I have one of these graphs So I won't go into it actually this is not the entire one. So this is one question. Yeah, so this is question one Question one wording one question one wording two until wording eight and this is similar to what we saw before So we have a true score which basically says well these some have to be consistent somehow They're all the same question about the same topic. They have to measure the same thing Then we're saying well for example these these two and these two are Measured with eleven point scale. So they have something in common because they have the same response scale So this is what we call the method effect And then social desirability and acquiescence is the same. So they're We call the loadings depending if they're for example positively worded or negatively worded. Yeah, and Acquiescence if they're agreed is agree or disagree agree So I won't go into the details of this if you want to discuss this I'm very very happy to discuss it, but I have a feeling that you might have enough of these graphs by the end of this session Okay, so now on to the results So we can estimate both systematic We can estimate both mean bias and variance bias. So that's kind of one advantage of this method So the way to read these numbers is that if I give you a degree disagree versus disagree agree You'll have a shift in the mean of the response of one of zero point twenty five standard deviations That's just because it's a greed is agree versus disagree agree Okay, the same for social desirability if I change one word So if I say fewer instead of more I have a shift in the mean of the observed item by point 18 standard deviation and The same for method actually if I if I use a 11-point scale I tend to be less extreme than if I have a two-point scale an average by point 47 Okay, and this could go in different directions depending on how the question is asked But at one extreme all of them could go in the same direction and you could have a shift of one standard deviation Just because of the way you asked which is slightly concerning for measurement There you go The next thing that we did which is kind of exciting is to decompose the variation so we have We have the total variation of a question and we want to know how much of it is actually what we want to measure What we were calling true score before or trait and how much is these other types of errors So this is kind of over all the questions and all the different ways to estimate things This is the average result. So an average the actually the valid The proportion of valid variation we get is around 60% and all the rest is is biased Okay The biggest part is random error So and this is similar actually to what we found in other studies in survey methodology Which is around 20% and then we have the rest of the systematic errors okay, and Again, we never were able to do something like this before until now we always for example We only had method and we assumed that there were were not there And the nice thing with this is that now we know how important is method compared to others Yeah, and also it gives us an idea. What are the best ways to ask a particular question? So for example if I want I have six questions if I want to find out which is the best question Which question should I use? Well, I can decompose do the same decomposition for each question And I see for example the worst question is this one the first one where more than 50% is is biased variation Yeah, and we see the last three questions are actually better than the first three So this would be an indication. Well, maybe we should just use the last three questions And maybe we should do something about the first three Another way to look at this is well We had the eight different ways to ask the question So I can do the same decomposition Depending on the different wordings. So that will give me an idea. What is the best way to ask these questions? So here we see that actually the main difference is between wordings one two Five and six compared to the others and the main difference between this is these are two point scales Versus 11 point scales. So actually our finding is that the two point Scales are better than the 11 point scale and one thing that is hidden here is that actually The two point scales has less variation, right? So this is kind of the hidden thing here two point scale less variation But more reliable versus 11 point scale more variation, but more bias Okay Okay, and then actually we can do a combination and see well is some wordings better for some questions than others So I won't go into details, but this is kind of an example of what you can do Okay, so the next thing that I'll do is I'll chat a little about the longitudinal aspect So I'm not even going to try to show you the graph of how to estimate that But what I did basically is the same model in three waves So it's really nice because understand the innovation panel is a panel. So we have multiple waves of the same People so we can actually look and see what happens longitudinally So what I did is did the same model for all the three waves and try to see how is Systematic error changing in time and again, I don't think we really did this before so let's see some of the results So on the left we have the mean of the different types of systematic error So acquiescence method and social desirability and on the right we have the variance So the variance is what we use in the decomposition of the variance and this is what we estimated initially for the mean bias So actually none of the differences are systematic So it means that there's no systematic change for any of these so it means they're stable in time Okay, so this is one of our findings that For example, none of these Goes in significantly up or down in time So that kind of contradicts Panel conditioning so panel conditioning says actually people get better at answering questions in time So we expect some of maybe some of the systematic biases to go down. That's not really what we found Another way to look at this is again We have for example the questions and then we have the decomposition by wave underneath and We see that the sizes of these systematic errors are similar, but actually the sizes of the random errors tend to go down Okay, so these on the other hand supports the panel conditioning Hypothesis that people have are yeah more reliable in time the more you ask the same question the more reliable they are And then yeah, you can do more complicated things so you can have a combination like I said a combination of the wording and the The question by wave and again you can see if there are different patterns for different kind of combinations Okay, so this is kind of the conclusion So we me and Daniel proposed this method called the multi-train multi-error and we show that it's possible to estimate multiple types of measurement error at the same time and Actually, this is just an example of the design you can use so our what we're saying is that if you think you have a type of systematic Measurement error in your question just manipulate that and try to estimate it Okay, it doesn't have to be this complicated It could be simpler as long as you have a hypothesis about the measurement error you expect Then we saw that you need this kind of within experimental design So the same people have to answer multiple times and you assume you don't have any memory effects And one way to make that more plausible is to shift the order of the forms. So there's no systematic bias For the substantive conclusion, we saw that random error is the highest non-trade variance Which is kind of encouraging because a lot of the methods we have for measurement error only estimate Random random error. So it means that already that's a good thing if we can estimate and correct for random error We're halfway there We saw that the correlated errors or the systematic errors have an impact both on the mean and the variances And we saw they can be quite big so up to one standard deviation shift in the mean and up to 50% of the variation Which is kind of concerning for the social science as I would say Then we saw that I didn't really talk about this but we saw that the methods so actually the scale you use Influences both the reliability and the amount of social desirability So we actually found more social desirability if you use a 11 point scale then if you use a two point scale But I didn't really go into depth to that with that and Then the final conclusion was that systematic errors do not change in time Although random errors seem to decrease in time. So there's some partial support for the panel conditioning hypothesis So, yeah, do we solve all the problems in the world using the super complicated models? so probably not so I think we get some insight about the questions and The kind of the best ways to answer to ask questions and also about the trade-off So we can see okay method of how big is method effect compared to something else But then there are some disadvantages. It's quite hard to program and estimate this method So, yeah, it will be yeah, it's hard to convince other people to use it But again, we argue that you can do a smaller version of this Something like MTMM or a little more complicated than the MTMM in order to really understand your data and the measurement error Okay, that's it. Thank you