 Okay. Hello. Good morning. Thanks to Nasser's nice introduction. I don't have to say much. We will now tell you about our simulation study who's trying to find a very good way to compensate for various missing data patterns we find in biosocial research. This is rather becoming important since many of British or European or worldwide studies actually have already included biomarker data or planning to do so in the very near future. So this data, this kind of data from biological samples is becoming more and more accessible in larger numbers which allows for statistical analysis. So for our our studies we're using the English longitudinal study of aging or called ELSA which in its first wave had more than 11,000 men and women over the age of 50. This data comes originally as a sample from the Health Survey of England and additionally to the main questionnaires which happen in every wave. In wave 2, 4 and 6 there have been health examinations which included a nurse visit and blood samples and additionally in wave 6 we have also hair samples. So this day this biomarker data is available every two years for the same individuals. Missing information and Natalie already touched on that can happen at different stages in the data sets. For ELSA it's in three main parts. It can happen in the main interview. People just do not participate, for instance in wave 2 where it's the first time for us we can see or we collect biomarker data set. So that can be already attrition from wave 1 to wave 2. Then there has to be the burden of asking them for consent of a nurse visit to health examination. There might also be not eligible for a nurse visit due to bad health already. And then we have a next step where people can drop out when they're either not consent or not eligible to actually provide a blood or a hair sample. We're dealing mostly with three different kinds of missing data. One is missing completely at random, one is at random and one is not at random. And to just visualize the difference between all of them. If we analyze a relationship between an explanatory variable of X on the outcome variable of Y and we have a missing pattern, but the missing pattern is not related to Y or X. That's completely at random. So the missingness has nothing to do with either component of the relationship we're trying to observe. However, if the missing pattern is actually related to some of the explanatory variables, this is called missing at random. Since it is still independent from our outcome variable and if we take all the the explanatory variable where the missingness is depending on into account, put them into our model, we pretty much adjust for those missingness bias which would be otherwise introduced. Really problematic it becomes when we have an impact from the missingness patterns, which is also related to our outcome variables. So this becomes much more complicated and only including explanatory variables in your model will not compensate for the bias produced by the missingness. So over the years this is not a new problem and it's only a new problem now since we're trying to adjust it to biomarker research. But missing data has always occurred as long as there has been survey data. There's several methods out there. The problem is that there is a bit of concern and confusion about the effectiveness of several of those methods to compensate for missing data. So that unfortunately leads in many cases to either ignoring missing data bias completely or just simply assuming m-car or ma situation where data is random and no further compensation methods are used to. Therefore this study tries first to test and evaluate the effect of several of the analytical approaches to compensate for missing data. This is done by using a simulation study which is based on real data from ELSA Wave 2 where it's the first time we have biomarker data in ELSA. We compare five different analytical methods to compensate for missing data and we test those five methods on the three types of missing patterns missing completely at random, at random and not at random. This is done in six rather simple steps. We are only interested in the effectiveness of the methods on the missing patterns, but in order to run that we have to have a substantive model. So first we choose our substantive model, then we move on to ELSA and choose our variables for the substantive models. We then expand our sample size a bit. So we have a nice big sample size we can run our simulation on. In step three we run our baseline model on all the full data set to get our baseline true values for our true model. And in step four we create, from our full data set, we create data sets which have missing information and missing information in those three categories of again completely at random, at random and not at random. Then we test our five analytical approaches on those data sets with missing data and in the very end we compare the results of that with our true model from step three to see how close can we get to the truth using those different approaches. First step is our substantive model. We have chosen to analyze something to the relationship between socioeconomic status and some level of biomarker for the socioeconomic status. We look at education as occupation as well as wealth quintiles. For our biomarker we chosen CRP, that's C-reactive protein in the blood, which is usually a good indicator for chronic inflammation. And looking through the literature on exactly this kind of substantive models, we have chosen quite a variety of different confounding variables to isolate the effect of socioeconomic level status on the biomarker. You can see here below. Then we go to ELSA and have a look at which data we have. We have 8,780 core members in wave two, which is our base level. The problem is that we have, not the problem, but the situation is that almost only 70% of those have eligible blood samples and over 30% have actually no eligible blood samples in this data set. This can be broken down even further at which stage and why we don't have a blood sample. There could be a refuse to nurse for the refuse to have a blood sample. No blood was in the end taken because people were not eligible or something else happened with the actual blood sample. So we still end up with no information. So one second. So we end up with 5,899 eligible blood sample and pulse, which is then by the methods of bootstrapping boosted to a sample of 10,000 individuals. And this 10,000 is our baseline model. We run the model like a normal logistic regression on the locked CRP values, which gives us our baseline values for the coefficients of SES as having an impact on the CRP value. And this looks pretty much like these. So we have education where the reference group is high education and we can see a gradient, upward gradient of the lower the education, the higher CRP values is. So it is a negative impact on the CRP values for occupation, which is an ordinal classification. We see a bit more variety. And for wealth quantites, we see a very nice, the richer, the higher quintile the people become in their wealth quintile, the lower the CRP value is. And it's actually significance as well to most of the reference category of the lowest quintile. So this is our true model, which we define true model. And all the outcomes in the end when we have run our simulations will be compared to this to see how close can we get using the different approaches to this data. So now we're creating missing data, which was always fun. So completely at random means that we created missingness completely at random has nothing to do with our outcome where we CRP and it had nothing to do with any of the explanatory variables we put in for the mass situation. We created 16 strata, which are based on age, sex, wealth and health, which are both, which all four are used as a dichotomous variable, which creates 16 strata. And those 16 strata have different amount of missingness according to what we observed in real Elsa data. And creating not, not at random missingness. We have used those 16 strata and added another strata for those which are having higher CRP levels, have higher rates of missingness than those which are by a benchmark have lower CRP levels. So there we create the, the connection between missing pattern and the outcome variable of CRP. For each of those data sets, we created a hundred data sets, which, which has missing data. And these are the five approaches we test through. The first one is rather simple, it's complete case. That means we only run the models on all the individuals who have all the information available, which has CRP values, which, as we already see, is about seven, a slightly less than 70% of our model. The second approach is inverse propensity weighting. Before we come to the substantive model, we run a probate model which takes most of the explanatory variable plus some extra variables into account to predict if people have a valid CRP level or not. And this is then transferred to inverse propensity weight, which is then used in the substantial model to adjust for this. Then we have used in the third model a selection model using the mills ratio. And then we have two types of multiple importation. Multiple importation in general takes into account the pattern and distribution of explanatory variables in the data sets of all the data we have to implement data for where we have missing data sets. This is the, usually that is done a couple of times, hence the multiple import, multiple, in multiple importation. And a substantive model is then run on those iterations of incutations. And then our very sophisticated version is we combine a mills ratio and multiple importation. This is done in pretty much three steps. First, we run a multiple importation on all the explanatory variable to fill the gaps for that, but not on CRP. It has all the missing is still in there, but we use the complete set of explanatory variables then to calculate a mills ratio, which is then used in the imputation stage of another multiple importation to generate or to fill all the gaps for the outcome variable of CRP. So while the complicated stage, that was lots of computation. So in the end, again, we have our five model, five approaches from complete case to the combination of mills ratio and multiple importation. And we have our three kinds of missingness. And our expectation looked a bit like that. We thought if there's no bias from the missingness, so we have the M-car situation, probably not adjusting for anything is the best solution because anything else would just introduce bias, introduce patterns of missingness, which is just not there, so hence the green field. And we think the more complicated and the more sophisticated the compensation model becomes in the end, the more bias is actually introduced to a data set. So for M-car, we would probably see very bad results trying the latter approaches. It's exactly opposite for the amount of M-car situation, not compensating, just using the complete case will probably not compensate for any of the bias already existing. But then using some of the compensating methods, so using some of the compensation methods will improve the model in case compared to not using any compensation. So that's what we expected. And we'll come back to that and how good our expectations line up with our results. So as you can see, we have three different variables for socioeconomic outcome status, which is education, occupation, wealth. We have five different models, approaches, and we have three different missingness patterns. If I would show you all the results, it would still be in the afternoon. So I'm sticking to the wealth quintiles because they had such a nice tendency or nice gradient of an effect. We see here the results from three different models for M-car, Ma and Ma, but I put them in the same graph so we can compare them very nicely. You see these are the box plots of the 100 simulations and the 100 coefficients for each of the wealth quintiles. You see now here in its distribution. The dotted line is always the true model coefficients as our reference. So you can see that in most cases, we actually see that the inner 50% of the box plot is nicely lining up with our true model. So the whiskers are not too wide, so the distribution is not too wide, and we actually get rather nice results for that. So the results for M-car are very close to even over 100 simulations. It's very close to our true values. That's what we expected. What we didn't expect is that it actually works rather nicely as well with Ma and even not too bad to Emna as well. You can see with the higher quintiles of the fourth and highest quintile in Emna that it's moving slightly away from our references. So it is fitting worse than for the other missingness pattern, but still it doesn't do too bad. So complete case as an approach is not doing very bad. If you look at inverse propensity rating, we see a very similar picture that the approach is compensating very nicely or it's not introducing much bias for the M-car situation. It's still producing rather close variables to the close coefficients for the Ma situation, and it's not much worse for the Emna either. And we see something similar again for the mills ratio, which is doing rather good. Even it's not producing, you can see that the box plots are moving slightly away for the M-car, but not too much. And Ma and Ma situation doesn't look too, it doesn't look very different from the other approaches. We would have definitely expected to see larger differences. Then we have multiple imputation, and we had very high hopes of multiple imputation getting rid of most of the bias introduced for Ma and Ma. And we can see that it's producing rather nice results, but it's not very different from, again, the other approaches. So there doesn't seem to be much of an advantage of using multiple imputation over the others. And then came our brilliant idea of combining the multiple imputation and the mills ratio. And we can see here that it is producing, it's actually introducing quite a lot of bias, especially seen here now in the third quintile, where the inner 50% are not even close, anywhere close to our reference from the two model. And this in all three missingness patterns, the worst seems to be for M-car, which is expected because there was no bias for to begin with. So from our expectations of this should happen, it does look very different. First of all, our approach of using multiple imputation and the mills ratio together didn't perform well independent of the missingness pattern. It did worse with the M-car, but it's still doing quite badly for the M-car. So this is definitely not the solution compensating for missingness in the M-car situation. We also didn't see, what we didn't see was that the model was getting much worse using no compensation whatsoever in the complete case approach for ma and ma. So this was rather surprising. And in total, we can unfortunately only say at the moment that it's not very clear pattern. So the clear patterns we expected to see just didn't pop up. And the approach of combining multiple imputation and the mills ratio just didn't have the effects we were hoping for. And it's introducing bias for all three of them. So we do not recommend you to use it at all. We also saw that the M-car and the ma situation had rather robust results independent of which of the approaches we used. So distributions of the hundred simulation coefficient were rather close to each other. And the bias which was originally already in the M-na missingness pattern could not be compensated with any of the approaches we tested. So we still have to work on that and we will. So to improve this paper, we have two more ideas. One would be to extend the number of iterations of simulations we're running because for simulation study a hundred simulations is actually not too, it's not a very big number. Usually simulations are run on a much higher number but it takes a lot of computing time for us to do so since it's, in between steps are very complicated. But we're still thinking of maybe running another hundred just to be sure if situation changes. We're also thinking of simplifying our substantial model because we are only interested in the effectiveness of the analytical approaches for compensating onto the missingness patterns. So using a much more simpler model with less confounding covariates or at the moment we also have all three of our SES factors in the same model, we might have to isolate those effects to see clearer patterns. So we're still thinking about changing a couple of these and hope maybe the results look a bit distant after that. After that, we leave this simulation study behind and we want to test of the approaches we have now learned working well and not working very well. How much does the effectiveness have changed if we change the proportion of missingness? At the moment we run the simulation on the missingness scale, very similar to those in the ELSA dataset. But what if we say instead of 30% missingness, we bump it up to 60% missingness? Does the approaches we have seen which produce rather nice results, do they still produce good results? Or does it all depend on how much your dataset actually has missing values? So that's one paper. And the last one is we want to then extend our thesis to a longitudinal approach. That means we include data from the biomarkers as well from way four and six. And we run some growth curve modelling to see what happens over time with missingness. And that's all for me. Thank you for listening. If you have questions, we are very happy to hear about it. Thank you.