 This is joint work with Joe Saxo and this is a slightly different talk than what we saw before. So before we saw some presentations about how to compensate for missing data and what I'm more interested in is actually understanding the process that leads to the collection of the data and what are maybe aspects that influence that and that's important because only if we understand the process then we can correct for missing data and I'm looking at a specific thing and that is nurse effect and I'll explain what I mean by nurse effect and why they're important. So if you're here you're probably already convinced that biosocial data is really exciting so I'm just going to mention shortly why this is important and kind of new. So social data has some advantages so first of all typically or we like to think that it's representative so that means we can make inferences about the entire population. They're also reaching context so if you're a social scientist you appreciate that we measure things like value, attitudes and we know those are important for how people behave so we have very rich context including for example I don't know what their parents did when they were children which might influence health outcomes and lots of information from the social world and we also have unobservables and again this are the kind of social science unobservables like attitudes and values that we can't really measure with anything else and surveys are really good at measuring that and the biological part first of all yeah we like to believe their objective yeah so that's an advantage that is often cited for biological data versus like self-reports for example they're also rich in detail so for example we have lots of information about stress we have lots of indicators that tell us in very detail different stress levels and that could be very useful for looking at health outcomes and we have other types of unobservables like genetic data that could be used in different ways so there's a sweet spot there in between where we have a combination of the two so I think that's why there's such a big push in investing in collecting biological data in representative surveys so we have surveys like understand society English launch soon study of aging the cohort studies health and retirement study there's lots of studies that are surveys but also start to collect biological data and the ideal is to bring together the advantages of each so and like I said what's exciting is that we can answer new questions that we have never been able to do before so for example if you have a launch soon study you could see how people change in time you could understand the employment status of their parents when they were children and then you could also look at how those influence by their biology their stress hormones for example and we can also do exciting things like life course and in which we use both social and biological measures so for example Elsa has multiple waves of biological data collection so we could look at changing time for for example stress indicators so this is really exciting and again I don't think we were able to answer this kind of questions with any other type of data and with new data always comes kind of new difficulties so some of the new things we have to figure out is how to collect this new data so surveys traditionally have been done very differently we have interviewers going to the homes or maybe by telephone collecting data but now we need to collect things like blood pressure or things like actually getting blood so the question is how do you do that in a representative sample and there are different models to do that so you could ask people to go to a clinic and collect data you could send interviewers or you could send nurses so there are different ways to do the same thing and we need to figure out what's the best way to do that and secondly how to correctly analyze it so we could assume that if it's in one database everything is measured in the same way but it's not really true so as some of my colleagues already mentioned there are different stages of missing data that happened when you collect biological data and that's different than what happens with the survey both because the person that collects them is different but also what we ask people to do is different so we might have different mechanisms of missing which are really important so what I'm going to talk about in this short talk is about the new actor in data collection and I say new actor because I'm a survey methodology so I talk from the perspective of people that collect data until now we didn't have to deal with nurses okay so this is some well or you could but in a private I don't know you need to go to the hospital we didn't really need to train them and work with them and convince them to do a good job so this is really new so I'm going to talk about how this new actor actually influences data collection so they have to do the nurses have to do a couple of things so first they have to convince a participant to take part so usually there's a main survey and then a few months later the nurse goes and collects biological data and there are different models so sometimes they would be contacted by telephone and then the nurse just goes there and does her job but sometimes she actually has to convince people to participate and again this is very new for them so most nurses sit in the office and people are desperate to come to them and get their help it's not their job usually to go to people's home and convince them to give blood that's like a very different thing than what they were initially trained to do so they again like for survey methodologies they need to train them to do this new task that they never had to do before after they convince people to participate typically what happens is they collect some biomarkers so for example they collect a lung capacity blood pressure height weight and things like that and then comes the most difficult part where they collect blood so the first step is actually to get consent for blood so there's this big form that they have to sign in which they say okay it's it's okay to get the blood and it's okay to you to keep it for a long time and do whatever you want with it yeah so they're actually different things they have to consent to and after that there's another stage where they actually have to collect the blood so even if they consented they might not give the blood because for example they might be obese or it might be hard to find the vein and collect the blood so in all of these three stages the nurses are very important and what I'm interested in is if the nurses actually influence the non-response pattern so do are some nurses better at convincing people to participate are some nurses better at receiving consent and if that's true then they are actually influencing the non-response patterns and we need to take them in that into account when we correct for it later and it also gives us insight in if we need to change for example the training of our nurses and to give you an example how the nurses are kind of new and maybe they're not used to doing surveys so recently me and Joe were trying to look at para data so this is data that is collected in the process of a survey and looking at the para data of nurses that I don't think anybody looked at before and basically the para the quality of the para data was horrible okay so it wasn't useful so most of it was missing there were yeah there were really strange things happening and that's probably the reason is because they're not used to doing that and they don't think it's important and that's very different from my interviewer that is trained to do this okay so that's one thing so they have less interviewer training and they have less experience and also the task is very intrusive although I know they're economists that would say that actually giving income is more intrusive than giving blood but still giving blood is quite intrusive yeah so it's a very different task than the normal tasks that we have we ask interviewers to do so yeah these are the reasons why I think the nurses are important and again this is an empirical question we don't I don't know anybody to have looked at it before so the question is do they matter if they don't matter that it's fine we can move on if they matter we need to do something about it so this is an example of kind of the process of data collection so this is from understanding society so you have wave one in this case we have some respondents in green and then some non-respondents in red then we have wave two and again we have some more non-respondents and then in wave two a sub-sample was selected for the nurse visit and then we have this this free stages so we have the nurse visit we have the consent to blood and then we have the actual blood collection and we see that with each stage we are missing some people and in all of those stages the nurses are involved so my question is do they differentially influence these stages or not so that's my research question so how to look at this so first of all I'm going to separate the free stages the nurse visit blood consent and blood collection and the reason I do this is because the mechanisms for missing might be different so for example if you for the nurse visit and convincing somebody to participate you might have no contact so it might be related to if people are working or not or the kind of jobs they have on the other hand if you have the nurse visit you need to get consent and that again is kind of a different process because it might be related if people trust the state or they try they trust the nurse or the agency so it might be related to other things than the mechanisms that live to missing for the nurse visit and finally the the blood collection also might have different mechanisms so for example it's it might be because once they collected they can't really get the data because they're obese or maybe because just the nurse is not very good at collecting such such data okay so we have free stages and then statistically to do this we need to do a couple of things so first of all nurses are not randomly distributed around the country so for example maybe better nurses are around London and the nurses that are in a different region are less good so in order not to have a bias in that we need to separate kind of area effect and nurse effect and one way to do this ideally would be to randomize nurses all over the country but that's very expensive and that will never happen so one statistical way to do it is to try to separate them using a multi-level model so we use a cross classified multi-level model in which you we estimate for areas which are LSOAs and then nurses and then we also correct control for characteristics of the respondents and some characteristics of the nurses so this is statistically how we we go about it so the data I'm going to use come from understand society so first we use understand society wave 2 we have around 25,000 people that were eligible to have the nurse visit and out of those around 10,000 gave blood so our question is what happened to this 15,000 and if the nurses influence the process of non-response then we look at wave 3 of understand society where they have a new wave of the British household panel so this is a older version of understand society that started in 92 if I'm not mistaken so for there there we have 9,000 people eligible and around 3,000 gave blood so these are the two databases that we're looking at data collection was done both by not send with kind of the same nurses so in principle we might expect differences because we have different people so we have in BHPS people that are older and also people that are very compliant they've been in the survey for 20 waves so they're really nice respondents while in understand society we don't have that we have more new respondents that might drop off soon okay so first what I do is I have the free stages here so nurse visit with blue blood consent with red and then actually giving blood with green and then I have here on the left some characteristics of the respondent so this is for understanding society so for example and the scale here is odd ratios so if it's bigger than one then there are higher chances of participating for example for females they have higher chances of participating in the nurse visit compared to men but they have lower chances to participate in the to actually give blood compared to men and this is kind of interesting because it shows that the mechanisms for missing might be different yeah and this is one of the reasons why we might want to separate them then we have some things that we expect so we expect older people to tend to participate more compared to younger people so this is something that we find often we find that for example having a partner is important for having a nurse visit but maybe not important for the other two stages and so on so there are different kind of patterns we find that if you're in London it's harder to do the nurse visit as you would expect and also if you have actually a long illness you're more likely to participate so these are the characteristics so we could use information like this in multiple imputation or waiting to compensate for missing data and again I think one insight is that we have this missing different patterns and we should take this into account maybe when we compensate for missing data this is the same thing for BHPS most of the patterns are quite similar so I won't go into details so the more interesting part is about how the nurses influence data quality and there are two parts to this one is the amount of variation so how much of the variation is explained by the nurse in the non-response and the second part is is systematic so for example are more experienced nurses better at this than non-experienced nurses and that's also interesting because we can use that both in the correction on also in kind of designing data so we could offer more training for those nurses that have less experience so what do we find so again here on the bottom we have the three different stages and then we decompose the variation of non-response into three parts so this is the nurse part the area part and then unexplained okay so what we're interested in is this red part so the assumption would be that the nurses have no influence if that's true this should be zero so they would explain nothing of the variation of non-response for each stage so we see that that's not really true so we see that they influence non-response for the nurse visit they also do that for consent and the same for giving blood and for understand sight we see that actually the biggest effect is for getting the blood which kind of makes sense because that's where their skills are important so if you have somebody who let's say is obese then the skills of the nurses are really important in actually collecting the data for BHP is the effect is kind of similar at all the stages so we see that around 10% of the variation in non-response comes from the nurse okay so one way to think about it is you as a cluster effect in in complex sample design so that's also one way in which you could correct for this or you could use something like multi-level in your analysis to correct for nurse effects the next step was to look at the nurse nurse characteristics so unfortunately we don't have lots of nurse characteristics we have gender which is not very useful because 98% are female and then we have three other actually I'll go to that in one moment first I want to show you these graphs so these are kind of predicted probabilities from the multi-level model and the way to read them is each blue point is a nurse and then they have a probability of getting interviews and then a confidence interval so we expected all of them to be here but actually what we find is that some are better than expected and some are less good than expected so this is understanding bhps and then this is for the first stage and we can get the same for consent to blood and the same for having if they actually collect the blood or not and this is actually quite interesting because we could use this information in different ways so first of all when we design or we collect data we could actually identify for example this group of nurses and offer more training or understand why they're underperforming also we could look at why some nurses are better than the others and take that into account and then we can also include these indicators in a different model so if we want to correct for non-response we could make an indicator saying this is a better nurse than average or this is a worse nurse than average and put that in our models for non-response and I think this is kind of interesting information we can just get from after data collection okay so next to the nurse characteristics so here again I have just two stages the blood consent and actually getting the blood for understanding bhps so female not a lot is happening here experience again nothing significant and the only significant finding is for white British so it's actually if the nurses white British they have lower chances of collecting the blood in understand society and bhps but that's the only significant effect we we found with nurse characteristics okay so a few conclusions first we found low to medium effects of nurses so around like I said around 10% of ICC and again that might influence standard error so I would argue that well as a survey methodologist we should take that into account at least as sensitivity and again we should do that because we have that in the data so when before publishing we you could just control for that and see if you have different results we saw that the biggest impact that listen understanding site is actually on the blood collection so again we might think why that is the case and maybe offer more training especially for cases that might be difficult to improve this in the future and then the nurse characteristics explain only a small part of the variation so we explain only between four and ten percent of the nurse variation so clearly there are other things happening that we don't know about and I think that would be interesting to I don't know do a survey of nurses or get more information to understand what really is causing these differences and again I think that would be useful both for correction and also for improving the training or improving data collection so okay what does this mean so should we stop collecting blood from surveys or what what what does it mean so that's what would be no so nurses are clearly doing an important job like I said the data is really valuable we can do new things but also we can probably improve the way they do their job so either during data collection or also trying to train our users to use that information to correct for missing data and convince people like I don't know Georgia or or just to include it in their in their models for non-response because the data is there it's it's basically free and yeah we should we should use it so I think that was it for me thank you so much