 The second part of this talk is about some of the data quality issues researchers should keep in mind when they're analyzing biosocial data. And I'll be talking about two kinds of issues in particular. One is about the mode or collection condition, whether that's important and whether that should be taken into account when analyzing biomarkers. And I'll also be talking about some of the quality control or QC processes that should be looked at at least by science researchers when they're looking at biological data sets. A number of biosocial studies tend to use two kinds of methods to collect biomarker data. The gold standard is the clinic collection. That's because it's a very standardized process. Participants are invited to come to the clinic and so the blood samples are collected and stored and processed immediately. So because that's a very controlled environment the biomarkers that are obtained are thought to be of a very high quality standard. And that's the way how a number of studies in the UK have collected their biomarkers. The Avon longitudinal study for example, the 1946 birth cohort study. When the participants were age 63, the Hartfordshire cohort study went to civil servant study. The other way of collecting biomarkers or blood based biomarkers are when participants are visited at home and their blood samples are collected and then posted to a laboratory. And that's the case with understanding society, the Southampton women's study, the 1946 birth cohort study when the participants were age 53, the health survey for England, and the English longitudinal study of aging. Now when people are invited into a clinic condition, as I said, it's a very standardized environment. So the temperature is standardized. For example the collection conditions are standardized. And the blood samples are usually drawn through a vene puncture. So they draw blood through a person's veins and that blood sample is immediately processed and then either stored in a freezer or the blood analytes are measured then and there. In contrast, in population surveys what happens is that there's a delay in the processing of that blood sample and the storage of that sample. And here we've got some examples of the different conditions in which blood samples are taken and stored in the top left-hand corner. We've got a nurse in a clinic setting and as I said that is a very controlled environment. And so the blood that the nurse draws is immediately stored and processed in very high quality ways. Whereas in the top right-hand corner we've got a nurse taking blood pressure but could be taking blood as well from a participant at home. And there the environmental factors are much more variable. For example the room temperature could be highly variable from one home visit to the next. The time of day is particularly important as well. So in a home visit the nurse may not have complete control on what the participant did just prior to their visit. So the person could for example have smoked or had a lot of food which could influence their levels of biomarkers. Whereas in a clinic visit there's to some extent some control over the person's activity, the respondent's activities just before the samples are taken. The blood samples that are taken by the nurse in a home visit are usually stored in particular ways and they can be sent by post. For example in the Jiffy bag and the process of posting the blood samples, the environmental conditions in which these blood samples are then exposed to are as you might imagine quite different from the setting in a clinic situation where the blood samples are stored immediately in freezers. And in the bottom right hand condition part of the slide you see some pictures of other conditions that affect both home visits as well as clinic blood sample collections which is that whether it's a weekday or weekend or what month of the year all of these factors do tend to affect some of the blood based biomarkers. So it's important to keep these in mind, these considerations in mind when analyzing biosocial data. In the top graph I've shown you some the distribution of the times in which nurses visit the people at home in understanding society and we see that it's a bimodal distribution. We see that nurses tend to visit people just after 10 o'clock or if people are working they tend to visit people at home around 6 or 7 p.m. at night. So you can imagine that if these are the times in which blood samples are taken and time of day has an effect on particular analytes then we really need to be considering what time people took their biological samples. In the bottom picture we've got a diurnal distribution of the stress hormone cortisol and in general people have a very marked diurnal pattern so as people get up their cortisol levels shoot up and then over the rest of the day the cortisol levels come down. So you can imagine a nurse visiting somebody at home collecting cortisol data in the morning it's likely to have very different levels of the stress hormone cortisol compared to a nurse visiting somebody else later in the evening. I'd also like to talk about some of the quality control issues that researchers should be aware of when looking at the biomarker data and that is largely down to the labs that process these blood based biomarkers. They're divided into internal and external quality control processes some of the biomarkers have impossible values so for example some biomarkers like height and weight they're likely you can imagine some if you have somebody in your data set that is 10 meters tall or 1,000 kilograms in weight you can imagine it's going to be hard to anonymize that person because they're going to be a completely unique individual. So you can pretty much rule out that there are impossible values so it's good to look at the distribution of your biomarkers to see whether there are some impossible values and treat them as outliers. But independently of that when the blood based biomarkers are processed in a laboratory what the laboratory does is that it tests when it goes through the procedures that derives the blood based analytes it repeats this on another day and hopefully there's a very strong correlation between the analytes that they get on one day compared to another day. So that's called the intra-acid coefficient of variation and ideally we want that we want there to be a very small amount of variation so less than 5% is within acceptable limits. So that's comparing how one particular biomarker within a lab compares to the same biomarker when it's processed on another day. But the external quality control measures are comparing how the lab does in relation to other labs in processing the same analyte and that's measured through the standard deviation index. So it's a measure of total error in analyzing a particular biomarker in comparison with all the range of labs that have analyzed that particular analyte. And so once again we're trying to get at low values of the standard deviation index we want to have the analyte that is measured by this particular lab to be close to the overall levels that are measured by all the labs. So a score between below one standard deviation index is generally very good. And also I'd like to talk about specific biomarkers. So far I've been talking about biomarkers in general but when we get at sort of specific blood-based biomarker analytes we should be keeping in mind that each one of them has a different meaning has a different significance. So this slide and the next slide looks at one particular measure called C-reactive protein CRP for short and it's a measure of systemic inflammation and usually somebody that has values of CRP between 3 and 10 milligrams per liter is denoted to have systemic inflammation however somebody could have greater than 10 mg per liter CRP values which denotes current or recent infection. So the meaning of high levels of CRP is completely different when it's over 10 as compared to when it's between 3 and 10. So when it's between 3 and 10 it's considered a measure of cardiovascular risk. When it's over 10 it's a measure of infection so very often people when they're analyzing CRP in relation to cardiovascular risk they delete values greater than 10 because they're not interested in whether or not somebody has been recently infected. CRP is also very strongly influenced by people's medication and their anti-inflammatory medications statins, contraception and hormone replacement therapy. In this slide I show the distribution of CRP for men and women and one thing you'll notice immediately is that the distribution is highly skewed. So when we're measuring, when we're analyzing CRP as a dependent model in a regression model for example as a dependent variable in a regression model you might want to think of ways to try and normalize this distribution in order to make the assumptions underlying regression models more plausible. So to sum up about data quality issues to keep in mind when analyzing biological data sets we need to consider the normal ranges of the biological variables. If they are available we need to be able to identify outliers and do something about the outliers. We need to hopefully identify whether the respondent has taken any relevant medication and either control for it or maybe delete people with particular medications from the analysis if that is not central to your research question. We need to consider some of the statistical transformations for highly skewed biological dependent variables if we're looking at that in a regression modeling context. We definitely need to keep in mind the context of the blood sampling like the time of day, the room temperature, whether or not somebody had a recent operation if the person had recently smoked or had food or alcohol and also keep in mind the laboratory based quality control processes in producing the biological data. Is it a good lab that has produced these biomarkers?