 Welcome back everyone! We'll start the second session. Our next speaker is Mr. Josh Bet. Josh is a research associate in the Biostatic Center at Johns Hopkins School of Public Health and the director of Analyst Corps in the Cochlear Center for Hearing and Public Health. Today, Josh will be discussing how the analytic approach to data from nationally represented surveys such as NHANES may differ from the approaches taught in Statistic Social. Josh? Thank you very much, Jungjoo, and thank you for the invitation and kind introduction. And thank you all to my previous panelists for setting up this talk. So I'm going to be talking about broadly about analyzing data from complex surveys using NHANES as a particular example. So first, what I'm going to do is review the statistical models that hopefully you all know and love before we go into what makes these complex survey designs different. We'll talk about what features might be included in the survey design and how they need to be adjusted for in the analysis of data. So we're going to talk about what these survey weights actually mean, when and how should we use weights. And for the most part, I'm going to stick to the concepts behind design-based inference or the analysis of survey data, focusing on the what and the why, not the how, because there's a lot of excellent resources I'll point you to for actually carrying out these analyses in SAS data or R. So as George Box famously said that, remember that all models are wrong. The practical question is how wrong do they have to be to not be useful? So whenever we're going beyond the data at hand and performing statistical inference, there's some statistical model in play. And statistical models are like other types of models that you may have encountered in practice. They're useful fictions where we strip away the complexities of real life in order to get a tractable solution that hopefully still allows us to do some important work. So you may have heard of in physics, there's like ideal gases and point masses and things like that. None of these things actually exist in reality, but that doesn't stop us from still using them to solve important problems. There's a joke that, you know, an engineer will assume that a horse is a perfect sphere if it makes the math easier. Well, statisticians are the same way, is that we use statistical models. They may have some kind of silly or unrealistic sounding of assumptions, but still in many cases they can give us useful information, but it's important to remember when those assumptions might be violated and give us results that are not useful. And the power of statistical models are based on the rules of probability. So what do I mean by probability? Here I'm going to be talking about frequency probability, which talks about the relative frequency of an outcome and a large number of experiments. This could be a toy example like taking a coin and tossing it 10 times and counting the number of times that the coin comes up heads. Not very useful scientifically, but this also covers things like taking a random sample and then, you know, measuring height, weight, blood pressure and environmental exposures. So frequency probability is something that we can empirically assess. So we can toss a coin or roll a die and assess its frequency properties. And it has its basis in, for example, the physical properties of things. Is this a fair coin? Is this a loaded die? Or the properties of measurements? And this is something distinct from something called belief probability, which is based on the degree of belief based on evidence. This is connected to Bayesian statistics. And we won't be talking further about that, but there are a lot of confusions about statistics that come from this frequency probability versus belief probability distinction. So that reference at the bottom, the Ian Hackings book is an excellent book for learning about probability. But if you're ever tempted to make these sorts of confusions, remember, are we talking about, you know, the relative frequency across experiments? So if somebody talks about the probability that a team is going to win a football game or something, they're not imagining multiple replications of a football game and counting how many of those times a particular team won. So here's an example of a probability model in practice, the binomial model. So in a statistical model, we're trying to learn about some feature of a population that we're going to call a parameter. And for a binomial model, we're looking at a binary outcome. So the parameter or the feature of interest is the proportion or the probability or prevalence, whatever you want to call it. So if we know the population prevalence is, let's say, 20% like this binomial model on the right, and we take a sample size of 100 individuals and look at the sample prevalence. But even though the population prevalence is 20%, you may see a sample prevalence as high as 30%, or potentially as low as 10%. So the binomial model tells us the probability of getting a particular sample proportion based on the population proportion. So this is useful whenever we've got a fixed sample size that's set in advance. So collecting data on 100 individuals, that the outcomes are binary, that they can be, let's say, heads or tails on a coin flip. The coin can't land on its side. And that the outcomes are independent, that the outcome of one coin toss doesn't influence the outcome on another coin toss. And that they all have the same probability or prevalence in the population. But the beauty of these models is that they can be applied whenever these situations are met. So it could be we're sampling from a population, we're looking at the prevalence of a particular phenomenon. You may see this as one event referred to as a success, but that largely comes from kind of the games of chance origin of probability. So success is just the event of interest. It could be a good thing like survive versus not, or it could be a bad thing like malignant versus benign. And so you can see as the sample size gets larger and larger, the probability of getting a particular sample prevalence that's, let's say, 30% when the true population is prevalence is 20%, gets less and less as the sample size becomes large. But usually we're in the world of statistics that we have, let's say, 100 observations and we've observed 28 prevalent cases in those 100 observations. And we wanna know, well, did this come from a population where the prevalence is 19% and maybe it's an overestimate of the population prevalence? Or maybe it came from a population where the prevalence is 38% and maybe it's an underestimate of the true population prevalence. Or maybe it's actually came from a population where the true prevalence was 28%. So this gets at the idea of maximum likelihood that you've probably encountered in practice. So the binomial model tells us if we've seen a certain number of successes out of a certain number of trials or a certain number of prevalent cases out of a certain number of individuals, we can get a point estimate, which tells us the best estimate of what's the prevalence in the population. As well as its standard error is how precisely do we know this? Is this within a margin of error of 10% or 5% or 1%? And we need this because there's inherently going to be sampling variability that we're taking a random sample from a population and the sample is going to imperfectly reflect that population. So we need to take into account the precision of the measurement itself. The thing is we only get one of those samples from the population, typically. And so we use the statistical model of the sampling distribution, like the binomial distribution, to infer from a particular experiment to a particular population. So here's an excellent reference that gives a visual and interactive approach to concepts and probability and statistics. So in confidence intervals are something that's often misunderstood but are incredibly useful because it tells us what are values of the population feature, such as the population prevalence, that are most consistent with the sample we've observed taking into account uncertainty due to sampling variability. So over on the right, you see 100 studies, each of whom involves sampling 100 individuals from a population where the prevalence is 20%. And you can see the population prevalence is indicated by that vertical line. And because each sample is going to vary from the population proportion, we want to know what is our values that are consistent, values of the population prevalence consistent with the data we have. And you can see that in the long run of studies, if we were to consider a large number of them, when applied correctly, confidence intervals, for example, a 95% confidence interval will correctly cover the population prevalence 95% of the time. But this is 95% confidence, not 95% probability. Because once we actually have calculated that interval, it's no longer random in the frequency sense. Remember, thinking about sampling repeatedly from a population. And many people find this counterintuitive and not very satisfying. Because once we calculate it, either the interval covers the parameter or it doesn't. We just don't know which is which. And in order to kind of clarify this, think about like a lottery ticket. If you go to the store and buy a scratch off ticket and it advertises one in a thousand will win. What they mean by that is if you repeatedly buy many, many tickets on average, one in a thousand of those will be winners. But it's only random until you actually choose a scratch off. Once you've bought it, you either won or lost. You don't know it yet. So that's the difference between probability and confidence. So in the binomial statistical model, we're assuming that data are generated from some random process. And by using statistics to infer about the features of that population or the features that govern that process, we can generalize to wherever that process is in play. But there's many sorts of questions that these sorts of models can't give really a satisfying answer to, such as how many people in a particular population or subgroup have a risk factor or disease. You know, these are infinite population models that doesn't necessarily give us a satisfying answer to those questions. So we need a different framework for official administrative or vital statistics. And this is called design based inference, as opposed to what you may have been taught about that the model based inference approach. So in design based statistics, we're sampling from an instead of sampling from an infinite population, we're sampling from a finite population, such as non institutionalized individuals in the US. And in this setting, we consider our data are fixed. Everybody in the US has a height, a weight, a BMI, and so on and so forth. But the selection and participation and observation of that data are where randomness comes into play. So the goal of design based inference is to infer about counts, means and proportions in this finite population. We're not trying to, you know, use the US to infer about some other population. And in design based inference, we're trying to learn about what would the population parameter values be, means, proportions, totals and so on and so forth. If we were to observe everybody in the population. So features of these designs include probability samples where individuals have a known non zero probability of selection. We may divide up a region into different clusters and randomly sample these clusters. And that makes it more feasible to carry out these surveys and then do stratification where we divide up the population into more homogenous groups and then sub sample those, those strata. And this allows us to do as others have talked about over sampling so we can ensure adequate representation and studies in order to give stable or precise estimates. And importantly, if we ignore that the data were generated in this way, this can result in serious bias in our estimates and our standard errors. So not only can it give us the wrong answer, it often makes us more confident in those wrong answers. So a simple random sampling is usually what's assumed in the world of model based statistics where any sample size of a given, any sample of size and is equally likely. But think about it, if we were to try to do that in the US would be going back and forth coast to coast to try and cover a simple random sample. It's just not feasible. So that's why we have multi-stage sampling designs. So where we sample counties and then segments of counties then households within these segments and then individuals within households. And stratification, the reason we do that is it gives us greater precision of estimates, especially when the variability between the strata are large and the variability within the strata is small. So that translates to what might be called a high intra-class correlation coefficient. And when the strata membership is correlated with the outcome, this can actually give us a big boost in the precision of our estimates. So survey weights, this is something that always comes up in discussion. Well, what do these actually mean? So I talked about how everybody's got a non-zero selection probability. And that means that each person represents a certain fragment of the population. So for example, if we've got 35 million people in our population and we do a simple random sample of 3,500 individuals, everybody has an equal probability of selection and that probability of selection is one in 10,000. So that means that each individual in the sample represents 10,000 people in the population. So if we have 400 of those 3,500 individuals have hypertension in our sample, we'd estimate that 4 million cases of hypertension in our particular population. So this is called the Horvitz-Thompson estimator of the population totals. And we can get the mean by dividing by the sample size. But we only know the probability of selection at the beginning. What if people don't respond, or they respond but they refuse parts of the data collection, or let's say there's a loss of data once it's collected. We need to estimate what is the propensity of responding, and that's not something that we can know in advance. And all of the estimates that we make are going to make some assumptions about the nature of missing data. So if we want nationally representative estimates, we need to incorporate these weights because they tell us about over-sampling and non-response. And weighted estimates are typically going to give us higher standard errors, which translates to lower precision. But there's kind of a tug of war with the effect of stratification. So stratification can allow us to gain precision. And cluster sampling, we actually lose a little bit of precision because of spatial correlation. So two people within the same zip code are going to be more similar to each other than two random people selected in the same state. And they're going to be more similar to each other than two random people selected from the entire country. And this leads to the design effect, which tells us about how precise our estimates are in this current design relative to the simple random sampling design. And design effects typically are much greater than one, indicating that if we assume simple random sampling, we're overestimating the precision. And so this relates to the concept of the effective sample size so that even though we have, let's say, 10,000 individuals in our sample, that may only be the equivalent information of, let's say, 8,000 individuals who were sampled at random. So Andrew Gelman opened a paper with the quote, Survey weighting is a mess. So what's actually in these weights? Well, it's not just the probability of selection. It's actually includes non-response adjustments. So what's the probability that people respond, given that they were selected, and then post stratification factors. So because of non-response and the random nature of sampling, we need to adjust these things so that things line up with population totals from, for example, census data. And that's what post stratification is about, to make sure things are representative. So you need to know which weights to use. Read the documentation. It varies by study question and by the study you're looking at. In some studies, you may have multi-level data, such as, let's say, states. And then within states, you sample counties. And within counties, you sampled schools. In schools, you sampled students. There may be weights at multiple levels of the data. And there may be separate weights for, for example, longitudinal studies versus cross-sectional studies within the same survey. And other kinds of weights exist in software. So just because there's a weight argument in your software, doesn't mean that it handles survey weights. So read the documentation carefully, because those weights may be frequency weights, which basically tell the software to make multiple copies of that observation. Or there may be precision weights, which are related to the variance of the observation for those that are familiar with, like, weight of least squares. So rates of participation vary within the population, so we need to adjust for non-response. And the greater the degree of that non-response, the greater the potential for bias, and the more important the assumptions are in our analyses. So when talking about missing data, we talk about missing data mechanisms. So missing completely at random means that the missingness is unrelated to both the observed and unobserved factors about the individual. So this may occur in a random sub-sample. Let's say something is too costly or invasive to do in everybody. We may only do it in a sub-sample, but because that sub-sample now needs to represent everybody in the population, we need to give them higher weights in order to accomplish that. So make sure that you know which weights to use for a particular outcome. So missing at random means it may be related to things that we observe, but not things that we don't observe. So let's say you're surveying an area and one person has a device failure. So that may be related to the area in which you're actually collecting the data, but because the failure of the device doesn't care about the characteristics of the people or their health status, that might be a missing at random outcome. Then there's not missing at random, which is the sticky situation where maybe people aren't participating or responding because of some characteristic that you don't observe. Like let's say they've got a health condition and are afraid of disclosing it due to stigma. So when they use weights, you should definitely use them if you want nationally representative estimates of totals and means. But as other people hinted at that, this is more complicated for regression models because when you do a model-based approach to survey data, you're assuming that selection is ignorable, that participation is independent of the outcome given what's included in the model. So if you've got in your model all of the features that were included in the sampling design you might still get estimates that are okay. So the problem is it not all data going into the weights is publicly available due to disclosure risk. There also may be interactions and other features that are modeled in the survey weights. So you have to think carefully about this. But the thing is weighted estimates give us lower precision, so it's higher standard errors. So it may be advantageous sometimes not to use weights. But this is, at the very least, you should be comparing the weighted and unweighted estimates for qualitative differences in the estimates and the standard errors. There's an excellent example of this in the literature. But even if you're not using the weights, you still need to account for the geographical features like clustering because of correlation. Again, otherwise you're going to underestimate your standard errors and overstate your precision. But the intra-class correlation is going to depend on the outcome. And the, have a larger effect when things are more spatially correlated. So just to kind of sum up, the model-based approach is when you're, you know, sampling from an infinite hypothetical population and you want to learn about some data-generating mechanism and generalize that to wherever that data-generating process might similarly hold. And the randomness comes from this random process. But design-based inference, we're sampling from a fixed finite population that we want to characterize. And data exists but are randomly observed. Okay? So just to kind of sum things up, design-based inference is something that's very different than what you may have encountered. It's got different aims, different languages and different tools. So be aware of them. Know the survey and the instruments inside and out. The documentation is very comprehensive and very useful. So read it not just once, but twice before beginning your analysis. I would advise people to stick to recommendations that a lot of thought have gone into those by methodologists. So I'd follow those unless you've got a really good reason not to. Make a use of the tutorials generously provided by NCHS and others. And then for all of you who are reviewing papers, I have some homework for you. I'd like you to review those papers very critically and pay attention to the methods. Because if they don't mention anything about the complex survey design, chances are they didn't properly account for that. It's not a mistake that those who know better would have, I have made. So hopefully I have kept us to time. I'm happy to take any questions if you have them. Let me leave up my references just for a second. All right. Are there any questions I can take? So I don't see any questions in the queue. I wanted to open it up to any of the panelists who might have questions. So some common questions that I get as a statistician is like, what if my software package of choice doesn't handle survey weights or the routine in my statistical package doesn't handle survey weights? I would, number one, advise collaborating with a statistician. We're generally friendly people. We are very helpful when it comes to analyzing and interpreting data. We can help point you to the right statistical package. And for complex things like, let's say, structural equation modeling and multiple imputation, other things like that, if you thought these were complicated in and of themselves when you bring surveys into the mix, they become even more complicated. So if, for example, you're analyzing longitudinal data and you can't find a routine for analyzing longitudinal survey data, I would highly recommend collaborating with a statistician who can advise you on the appropriate software and ways to use that software, because there's a lot of ways in which things can go wrong. So a couple of questions have come in from participants, which I think kind of get a little more into the nuts and bolts of analysis. So one relates to, I'm struggling to find survey procedures to test correlations, particularly spearman correlations in SAS. Could you recommend any resources or software? So you'll see in the further reading and resources, there's a couple of links there. The first is the website that goes with the first book on the references that, Hearing a West in Burglund. There's an excellent examples in all sorts of statistical packages there. Hopefully that should answer your question. I am not a SAS expert per se, but SAS data in R, for the most part, they can all handle the types of analysis you usually do for things that are more complex, like let's say longitudinal data, structural equation modeling, or more fancy stuff, there you have less choices. And for that one, I would check out the Lumley and Scott reference, which is an excellent overview in terms of software and then modeling. And then the UCLA site is highly recommended. They've got a lot of information on not just model-based statistical approaches, but design-based statistical approaches with worked examples as well. So here's another question, which is sort of scenario-based, although it sort of is both general and then has reference to a specific scenario. So question states, I understand the most accurate way to work with survey data is to use the entire data set available. And then specify your subpopulation of interest using the subpop command so as to get correct standard errors. But the question is, are there exceptions to this? And then this sort of goes into a specific scenario. For example, if we want to limit to people who had Medicare coverage for a certain time period, say the first six months of 2016, wouldn't we want the standard errors to only be estimated within the subsample? Or will we still want to use the whole sample? So this is where you use the subpopulation command, because this is the special area of what's called domain estimation, so estimating within subsets of the population. And you might sometimes get appropriate estimates, but if you don't account for how the subpopulation, you're not going to get appropriate standard errors. So as to actually getting into specifics, I think that's a little bit beyond the scope, but I would recommend looking into the, the Haringale West and Berglund book. It's a fantastic resource that it covers a broad range of statistical packages and particular applications, including domain estimation. Okay. And then we have a series of questions all related to weighting. So after conducting both weighted and unweighted analyses, how can we decide which set of results to report? So that's an excellent question. So number one is that there's a lot of people confuse statistics with a toolbox of tests. So make sure that when you're comparing results, the most useful thing is not usually a statistical test. It's usually a more qualitative look at what are the estimates and what are the standard errors and what are the differences and pay particularly close attention to things that are related to the way in which the data was collected. So since for example, NHANES samples differently according to race, ethnicity and age and other factors, you want to pay close attention particularly to those factors in your model when you're actually comparing weighted and unweighted analyses. So the Lumley and Scott example has an excellent kind of side-by-side of a case in which the unweighted approach may be okay versus where a unweighted approach probably is going to lead to bias. Thank you. And so somewhat related. So the question is if we have the waiting for the whole survey, and then they applied inclusion exclusion criteria to change and resulting in a change sample size, should we use the same strata cluster and weight as in the original whole survey? So I would highly recommend following the analysis guidelines and NCHS has done a lot of work in putting all these things together and they've got worked examples of how to do this in all major statistical packages. So it'll show you how to appropriately estimate, let's say sub-population estimates or sub-samples of the survey. So Josh, I'm sorry to interrupt. Josh, why are you talking to me?