 Hello, and salutations. Welcome to today's lecture, working with NHANES data. Welcome back if you've been here before, and welcome if you're new. I'm Monica Wahee, a public health data scientist, which is another way of saying a epidemiologist and biostatistician, and I like to do data science with my epidemiology and my biostatistics. And so today, I'm going to talk to you about the NHANES data set. So thank you for coming. I usually just talk a little bit at the beginning to make sure that everybody has a chance to show up. So I'll just start by talking a little bit about the NHANES data set. So NHANES, I'll get into what NHANES stands for, but just to say it, I'll probably get this wrong. Let me see. National Health and Nutrition Examination Survey. I think I did it right. Because sometimes I say nutrition, and it's not nutrition, it's national health and nutrition. But this was named in like the 60s. So National Health and Nutrition Examination Survey. So this is a United States cross-sectional surveillance data set, survey examination that will return to the E. So I'm going to talk about this data set because it's not as popular in the U.S. for data wonks like me to analyze as the BRFSS, which is the Behavioral Risk Factor Surveillance System, actually. And that's because the BRFSS is done by phone. And because if you can do a surveillance, like a cross-sectional surveillance study by phone, you can get like 455,000 people, right, which is big. That's a lot. That's a lot of people per state. It's really big. But remember the E in examination. If you examine people and it's not over the phone, like you're actually in person, you can't have a big N like that. So part of the reason why BRFSS is more popular than NHANES is that NHANES is a smaller data set. Okay. But as you probably guessed, well, wait a second, that E is pretty cool, right? Examination. I mean, you don't have that data in BRFSS. Won't that be cool data? And we're going to totally get into that. So don't get, so curb your enthusiasm. Okay. If you're like, whoa, there could be lab data. There could be smoking data. It could be all kinds of stuff, you know, height that they actually measured, not like BMI that I didn't just lie about my BMI, like they actually measured it. Yeah. That's definitely in that data set. Excuse me. But just curb your enthusiasm. All right. So before I continue, I just want to put a plug in for my free online workshop. It's coming up this Saturday and Sunday. It's a two-day workshop now. I was holding it three days over the week. But this is the first time it's being held on Saturday and Sunday. So maybe that's a little more convenient. We'll see. It's in two group sessions. And the topic is application basics, meaning applications. Like what are the basics of applications? Because I know, you know, data scientists like me, like I don't really build applications. I do a lot of application design, but why am I designing? Because I want the data from the application, right? Because I'm a data scientist. And so I had to learn about applications basics. And I want to teach you about it. And the theme is integrating application pipelines, which if you're not really sure what that means, well, a thing that you should really care about is if you don't have an analytics platform in your application pipeline, you might not be able to get the data out and do your analysis. So you probably want to be empowered with this knowledge. And I admit that this workshop, it's fun because we all talk, we all discuss different applications we use. But what you really learn about is terminology. Terminology we use in application development. You learn what it means and how to use it. And I know in epidemiology, when I learned epidemiology, one of the big things was teaching us terminology, and then we were so powerful with our language. And that's what I want you to get out of this workshop is to learn that terminology. The workshop is based on an actual course, an online course, which is part of my public health data science online group mentoring program, where you make portfolio projects. And that sounds exciting. Like if you get excited about that enhance data and you're like, oh, I want to do a portfolio project, then I really think you should show up at this workshop because then you could see what it would be like if you were in the mentoring program. And then, and also talk to me if you think you want to be in the mentoring program because that's what it's for, right. But if you're not sure you just want to have some fun little networking learning experience this Saturday and Sunday please sign up for this workshop. All right, so back to our regularly scheduled program. We're talking about the NHANES data set, which is available online. You can just download and use it. It's pretty well documented. We'll get into that. And it's a cross sectional data set of health data set that includes E for examination. And so that's kind of where we're going with this. Okay. So, oh, and you want to download these slides. Oh, I forgot to put that link in here. I could let me later I'll put the link for the slides but you want to download these slides because I've got three links for you. And they're actually just to blog posts that I have but they're blog posts about NHANES. And, and like the first one is just explains NHANES a lot and just kind of in a technical sense. The second one is actually just has a list of data sets that I've curated that you can use for portfolio projects and NHANES is in there. And, and then also my third link is ideas for portfolio projects. And again, if you were in our mentoring program you probably find those particularly well event, but some people are like do it yourself or as you can do your own portfolio project. So, the NHANES as I said is in the United States and it actually is our oldest surveillance effort. It started in the 1960s. Now, that is good and bad at the same time like there's some limitations to the fact that it started in the 1960s like if you think about it, there's a lot of stuff we didn't know the 1960s so we didn't really think to ask. And now, and also there's a lot of stuff we asked in the 1960s that we want to like keep asking it the same way. So we can trend, but then it's old fashioned, and also just things changed like how people use tobacco has changed and you know how people exercise and what they do for exercise has changed. Since the 60s, how they do the NHANES is they go door to door and they examine respondents in a mobile examination clinic, which I neglected to put a picture on the slide but it looks like kind of like a big bus. Or like if you've ever given blood, it looks like one of those blood mobiles. And what happens is when you go inside, I've never actually done NHANES. In fact, I don't even know where they sample from to be perfectly honest. I know in North Carolina, some people would have done I heard about that but I'm sure it has to be all over the United States because this is a stratified sample. But you go inside and they'll do the different examinations like my colleagues who are dentists, they're very interested in the oral health data. Because there's an oral health examination where they measure the teeth that they actually probe the six sites on each tooth, if you know anything about dentistry, which I didn't until I had those friends. Another thing they did, one of the things that sort of they do is they have exams and then they have questionnaires. So they have like this osteoporosis questionnaire where they ask all these osteoporosis questions. And then they have this oral health exam where they examine the oral health, but then they might have a questionnaire where they ask you about oral health. So it's pretty complicated. They have a lot of different measurements. Like honestly, if you go through all of them, you'd probably be exhausted if you were a participant. But anyway, one thing that I want to highlight about NHANES is that a lot of the questionnaire questions on the BRFSS are the same on NHANES. So like you might have heard of the general health question, like how would you rate your general health? Excellent, very good, good, fair and poor I think are the choices. Like those questions are identical in BRFSS and in NHANES, but you get a lot more in NHANES because you've got this examination and you also have other questionnaires. Okay, so what's in NHANES? Look at this big diagram. There's just so many questions they ask. They ask about corobidities. One of the big, they have a big diet module. They talk about alcohol consumption, reproductive information, you know, things like women getting reproductive health things. And also men that ask about prostate. They ask about certain medications like the osteoporosis. Are you prescribed any medication? They ask you about your leisure activities and they even I think do labs. They have some labs. We'll look at this in a second. But this is like, if you're salivating, you're going, oh my gosh, I didn't know that they actually have this data available online. That's kind of why I'm presenting this is because it's kind of a curb your enthusiasm thing because there's like kind of an issue with it. Okay, so don't get too excited, but I'll end with a happy note because I can tell you what you can do with this data that nobody's really doing right now. That would be very interesting from a data science perspective. So there's, you might go, okay, so this data is not doesn't have a lot of rows. It's not long data. It's wide data has a lot of different variables. Okay, fine. But it's really well measured data and it's run by the government. So why aren't more people using it? Right. Well, there's just a lot of issues. First of all, they have documentation. Well, they have documentation I'm going to show you that's sort of an easy to navigate way. But the problem is that it's not really that complete. So if you really wanted to, like, you'll find that you'll get to some variable and be like, oh, wait, this is not everything I need to know about this variable in order to use it in analysis. And so then you end up having to look up more and or just not look up more and just give up or whatever, you know, it's very challenging to design with the NHANES data. Like if you're used to BRSS, the way BRSS works is there's these core questions that are mandated by the federal government and every state has their own phone rooms where they call people in the state to ask the questions. This is BRSS, the phone version. Okay. And what happens is like in Minnesota, where I'm from, I'm in Massachusetts, but I'm from Minnesota and in Minnesota, there's a very large Native American population. And so I just remember there were some issues where we wanted more knowledge in Minnesota about that issue. So we used a Native American issues module, which is adding on to the questions, but it's just a module. It's like maybe even one or two questions. It's not very big. So if you study the BRSS, you'll see that there's the core data, which is what we normally download and use. And then the states will use different modules. And if you actually run a state department and you got to do this, you got to pick your modules and each year and you got to figure out how to do the waiting and all that stuff. So if you're using BRSS core data, it's just one big flat table. You just download it. If you're at the state, you actually work at a state department and you've got the data from the modules, they're sort of stored separately or they tend to be stored in separate data stores, even though they have a one to one relationship with the people who were called on the phone. Usually, I mean, they might have just done certain modules with a sample of people or maybe one module just for women or something, but then it becomes fragmented and it becomes a little hard to manage. Well, the NHANES data, everything is a module, like literally everything is a module. Every set of questions. Remember on BRSS, like they'll have diabetes questions and cancer question, whatever, but they're all in that big flat table. They're all in that big core question table. Well, here all of these questions, they're all in different modules and that's different data sets. So the technical term for that is a federated data sets, meaning that they all can be linked theoretically. There is this SEQN stands for sequence number, which is like the primary key. It's kind of like the study ID. And in every table, there's only going to be one occurrence of the SEQN. So in other words, let's say there's somebody who is 12, their SEQN is 12. In any of these little modular tables, there's only going to be one record for number 12, participant 12. But what if there isn't, what if there's zero records for participant 12 in that table? That's what the problem is, some missing data, right? So let's go over to the doco here. Here's the documentation. If you go to that link on the survey, well, first of all, this is the blog post that I made. And you'll see it's actually like kind of long, right? And you'll see that there's like a lot of explanation, like pretty detailed explanation. And actually, if you come to my next two events, like lectures, I'll be going more into detail about some of the operations in here, like some of the things you have to do in R, if you're going to use R, or whatever, SAS, whatever, the manipulations you have to do with the data in order to get an NHANES data set together. So you can actually answer a question. But for today, I'm just going to go over how you use the documentation and how you can try to even make a question of it. All right. And so that's over here. So let's just go to this. So this is a CDC. And one thing I just have to tell you is it looks like they mushed together a few years, like 2017 to 2020 got mushed together in the same data set. And prior to, like, I guess the last time I looked, they didn't do that. But if you read a lot of NHANES, if you read a lot of NHANES, like publications, you'll find almost every publication uses more than one year together. And that's because of this lack of data issue. First of all, that it's smaller than BRSS to begin with, because it just has to be. And secondly, because of this thing I'm going to show you right now. Okay. So I'm going to look at the documentation for this file. And if you go to the main page, you'll see that there appears to be documentation for a more recent file that ends in 2023. But if you go there and you scroll down, see all this stuff? This stuff is grayed out on that 2023 one. So this is like the live one, right? So if you scroll down from this up here, you'll see data documentation and code books using the data, contents at a glance, and contents in detail. Okay. And honestly, where I start always is in this quadrant. I feel so smart quadrant up here. Okay. So what we're going to do today is just go down the rabbit hole over here. And as we do that, what you're going to go is, oh, I kind of get why there are the other other quadrants, right? Because if we look at, like, for instance, dietary data, we'll look at their documentation and only have a million more questions. And so we'd have to go down here. Like, for instance, let's say we look at the lab data have any questions, we would have to go look here. And so basically, this is kind of your beginning of trying to design your project. And if you have any questions at all, you have to really read about it over here. Now, first, I just want to point out this says demographics and then dietary examination laboratory questionnaire and limited access. So we're not going to care about limited access. But remember how I said there are everything is sort of separating these different questionnaires like diabetes questions and osteoporosis question, whatever. So when I click on this, we're going to see a list of data sets. Okay. Also, I said that there's different labs that they do. And so when I click on this, we're going to see a list of data sets of the lab ones. And remember how I was talking about those examinations, like they weigh you and they measure you and then like that oral hall thing. When I click on this examination data that we're going to see that. And there's a whole bunch of dietary measurements and when I click on this, we're going to see a list of those data sets. So already you're like, oh my gosh. But where I'm going to start is this top one, demographics data, because I really want you to start there. Okay. And the reason why I want you to start there is that's kind of like the denominator. Like, this is the most people in this data set. This is has one row per person in the whole study in this data set. Right. So you can kind of see the universe of people who could be in your data set. Right. So let's click on this demographics data. Now, you know how I said I could click on laboratory I could click on whatever. All of those are sort of set up the same way this one is is at the top, you have all these links to different procedure manuals and instruments, as you can see, because they mush this together they have to put different years here. And I usually don't really go into these because I'm usually starting at the beginning of like I might have to go into these later but this is just me figuring it out right. So here's the one demographic data set but remember how I said if you click on like examination that you're going to see this blue thing and you're going to see a whole bunch of data sets. Okay, but I just thought we'll start here we'll do the one. And so what do you see here you see years, the name of the data set and then you see this P demo doc, which is it says doc file, and then you see this data file, which is XP T. Now let's say you're a SAS user. You might have heard of XP T XP T is SAS's version of like a zip file. So if I normally when you have a SAS file like a regular old SAS file, it's it's like let's say had a SAS file called hospital data, it would be called hospital data dot SAS seven be that right like it has the longest extension. Okay, let's say that that file was so big it was really hard to put it on like a disk like a CD or DVD or something. And I wanted to shrink it, like pack, like zip it. I can use SAS to turn it into an XP T file XP T stands for export, I guess, and it shrinks it. And then I might be able to get on that DVD. Okay. The issue is that when you use SAS and you download an XP T you have to unpack the XP T into a SAS seven be that. And I know I cover that in my SAS course on LinkedIn Learning. And I, I know I covered in my book on SAS. But it's a little hairy but it's not that hard. I actually do and if you show up to my subsequent events you'll see is I use the foreign package and are, and it just turns it into an R data set like you just read it in as an R data set and just jump over all that SAS business. But anyway, if you wanted to download the data set you click here but I usually am not downloading anything at this point I'm just shopping for data. So let's just click on the stock and just see what happens. Okay, I just want to sort of show you around what happened here. So those of you who are older, go back to Windows XP Windows XP SAS let's go back 20 years yeah. All right, now you're in the right mode. So this looks like it's about 20 years old. So it's got two pains. There's a pain on the right that has this and then a pain on the left here which. So, remember what we just did we just clicked on an actual data set documentation. This happens to be demo. If you wanted oral health examination you're going to get another page like this, right. So you just kind of have to get used to this format so what what does this page says it says some header stuff and some description and you know some information and hear about the different variables. But this is more like qualitative stuff that you can read. That's deep. Okay, but let's say you just want to shop for variables then you want to look over here at the right. Like you can see these kind of look like variables like sequence sequence number sqn and then you have data release cycle or but like here. RIA gender that's gender and then read age year read age year like everybody's used to read age year it's from like from BRSS and if you scroll down if you're an old government walk like me, you're kind of used to some of these names. But this is a demographic data set so there's nothing really more than demographic marital status, pregnancy status at exam and in fact, actually let's just look at marital status so I'm going to click on this I want you to see what happens. See that it just basically scroll down here, like see this is marital status pregnancy status like I could just scroll up by hand over here. That's not to like put down that list on the right, it basically is a lot like a PDF, you know when you have a PDF of the bookmarks but those are on the left. And this is kind of like a PDF of the bookmarks on the right so I'm not putting it down I'm just saying wow this really old fashion. Okay, so we get to here. And those of you who are used to SAS you're sort of seeing all this looks like a SAS code book output right, which is not my favorite format for documenting data but it's got the information in it. So we go and we see that this is a variable name so if we wanted this after I downloaded that demographic data set, I'd have to go look for this variable. And it says target both males and females 20 years to 150 years. You know, I hate it when we have to go to my mom's 150 year birthday 150 years what the heck is that. Okay, so what I have learned about this data set. Is it each of these questions is not not asked of everybody even in the demographic data set obviously pregnancy status that exam is not asked of met. Okay, so the issue is that you often do not know which questions are skipped, because this code book is not that clear. Okay, like if you look at the questionnaire, the actual questionnaire, or even the code book for the brfs stuff, you can totally tell where all the skips are like you can, there's no question. Here, it's just this documentation not that clear. And so sometimes you see something like, like, let me look at the pregnancy one. See this. So these are the values of the, the answer. So if you look up read expert, you'd see 123 or missing in the variable. One would mean yes positive lab pregnancy test or self reported pregnant. Exam. And see the count. This is the code book count for the whole data set. There's only 87, which kind of makes sense. You know, how many people would be lucky enough to be pregnant when they went into the Mac, you know, and the second one is the participant is not pregnant exam. So it looks like there's some exam that this came out of even though it's demographics. Again, if you're going to use this you'd have to go down the rabbit hole reading this like how, how did that get there. This is just a little quickie summary. And so apparently these people were examined and they were found not to be pregnant, and they cannot ascertain that the participant is pregnant exam. I guess they crappy test or I don't know why that happened. But then we have this missing and you just kind of assume all the men are in there. And so you assume that there was some sort of skip and so that makes you want to go back and do some more research before you use this variable. And again, if you show up in my subsequent events I'm going to show you a little bit about curating this data set or other data sets like this. I mean they're all kind of like this. This one's a little more fragmented than most of our data sets. So getting back to marital status I want to go over that one because that's that's more logical you would ask everybody about it right. But here is the issue, right. Here are the answers. So one is married or living with partner which is kind of nice I liked it that they, they group these together. And then to widow divorce or separated I like that they group those together because I end up grouping those together. They're separate in the BRSS and three is never married. And you can see these are the frequencies this is a cumulative frequency and this is the frequency in the data set. Now, remember in the denominator, right like so let's say you decide to study osteoporosis and that's more likely to be women and then you're trimming off some of the data you might not have this distribution. But you know, so then here's what's particularly annoying about this data set is you have 77 is for refused 99 is for don't know and blank is for missing. And for me, and I've had epidemiologists go crazy telling me how important refused. It's different if somebody says refuse than they don't know it's super important. Like I, I've been the recipient of that rant so much in my life when but it stopped in about like the early 2000s people stop saying that after sequel after there was a sequel. I don't know why people were really big into refuses different from don't know it's different from missing. To me, from an analytic point of view it's the same because first of all you never get that many. And second of all missing. Why is anything missing. Who did this data collection. So again, you know if you and it depends on your research question do you even care do you even care about marital status. And so this is again kind of a problem because if you take my courses on LinkedIn learning like especially my study design courses. It's difficult because you want to make a hypothesis but you don't know what you have to work with. So you kind of have to review the literature and get an idea, and then go review your data set and try to figure out what's going to work, and sort of like meet in the middle. Okay, so if I was going to say okay I read the literature and I want to say being divorced is bad for your osteoporosis or something right. Maybe this is okay because I've got a bunch of divorced people in here but maybe it's not okay because widowed and divorce is different they're almost together. You see what I'm saying and so I might have to go back to literature and deal with that. Okay, but this is just basically how to read one of these. One of these pages and if you have any questions I'll look at the chat and tell you okay so now I'm going to just back up. And we're just going to back up. And we're going to go back so remember this was the main page for I'm backing up on purpose because it's so darn confusing right. So now we're back so we went over demographics data now let's go and do dietary data ready. Okay, here we go and I also want to add something when I bring up this dietary data. My understanding after reading about a lot of this is that I think this dietary data. These data from and hands are used to calculate stuff about the American diet, like, like how much carbs we have and how much you know fruits and vegetables we I really I the stuff I've read has led me to believe that this data set is where they get that from. Okay, and so. So here we have now another thing I want to be clear about is, and I haven't worked with the diet data, but you might see that on the questionnaires in the data collection is different variables than what is served up here. Okay, so for example. I was looking at a data collection form, and it asks if the person every sushi or how much sushi date, but I cannot find the answer to the sushi question and any of these documentation so what I feel like they've done is taken the data and transformed into something else and now I don't have the raw data and so some of the questions on there I don't know the answer to. So like here we have dietary interview individual foods first day individual foods second day. So just just to remind you, there are different ways to measure diet. The way that's considered probably the most valid but it's still kind of messy is with a food diary. Food diary is where you literally write down everything you eat in a day you keep a lot of notes. And now we're in the land of apps right if I was running a diet study I'd make it get an app where people were putting in exactly how taken pictures and exactly what they were eating. And I could calculate it really well but in the olden days it wasn't that easy right. I mean we send them cards and say okay how much beans did you have or whatever. But the participants are really awesome. You know if people are in these diet studies they really apply themselves so they're cool. So we get this food diary data. And then the other way you can do it is with a food frequency questionnaire which is actually based on food diary data but if you're like, you know like I knew somebody who did a study of Japanese Americans. And so maybe a food frequency questionnaire for people who are not Japanese Americans isn't going to work in Japanese Americans, but a food diary will. So I'm not sure what was really going on here, but it looks like they did a bunch of food diaries. And as you can see, they made a whole bunch of data sets out of this. And I won't go into these because I don't really understand them, but you might be like well why did they do that and I think it's because they're feeding that official American diet thing which is. So one of the things I realized when I worked at a State Department for a short time I didn't, I mostly when I've worked at the government I haven't worked at like a State Health Department but I did for a very short time. And one of the things I realized that kind of hit me over the head is that the government collects this data for the government like the government has needs right. And so the government sort of prioritizes its own needs. So if you were like I don't really care about diet why they go through all this food diary stuff. Well the government needs data so sorry that's going to take precedence over what you want right. So, um, so yeah and if the government's not getting the data the government can't make policy. The reason why I just said that is I've noticed the brfs s they cut the core by have like half the questions. They don't have good measurements of like a lot of you like marijuana use and like E cigarette use in my opinion. And if it and I don't think they have good mental health measurements. Look at the epidemics we're having in the US like shouldn't they be doing a better job of measuring that. So that's when the government's not measuring stuff the government needs to watch out right. Okay, so now I'm going to click on the examination data one. This is one that you're probably more likely to use unless you're a nutritionist probably wouldn't use the other one. But I want you to just see. Now remember this is examination so like a clinician or somebody measured these people. So we have like odd audiometry. So are you into hearing you could study that you have to know how to undo the data and figure it out. Here we have blood pressure. That could be very interesting. Right. We have body measures and this is probably I'll click on this one because this is probably something we would really want because we have BMI in here. There's some weird stuff in here like we have head circumference and we have standing height upper leg length upper arm length so it's like I didn't know what was going to be in here but now that I see that there's hips circumference waist circumference. You know, waist to hip ratio is a thing. So if you were studying waist to hip ratio you could get it out of this right. So this is kind of interesting stuff. And that was just this body measures. The reason I clicked on it is I've clicked on it before so I kind of knew what to expect here. I don't really know. So this dual energy absorptionometry or whatever I don't even know how to pronounce it. This is a lot of that bone marrow density stuff I think that is related to the whole osteoporosis. But as you probably are realizing, even I don't really understand all these data sets and I'm really kind of good at this stuff. Each time I use a data set and then Haynes I have to really study it and study the domain. But like, I'm going to literally click on this one, this oral health dentition because I've pretty much memorized this one, because I did studies with it. Okay. So one of the things like I'm not an expert at dentistry, but one of the things that's really important to Dennis and public health people is how many teeth people have lost because you don't want to lose your teeth and you can lose your teeth through two main ways. One is through infections, you know, there's different ways you get the infections, and then they fall out. And the other way is trauma you just get hit with something your teeth go on. But this idea that the more common way you lose your teeth is through these various conditions that are infections essentially. And so it's nice to just know how many teeth a person has and that can kind of be a proxy to how healthy their mouth is. But of course, that's not even in here, right, like, like how many teeth the person has is not in here. But for each tooth, we have like, see, there's 32 teeth. The clinician and then Haynes filled out for each tooth its status, right, like primary tooth present. You know, that's when your baby right or permanent tooth present. See, that's why this is zero because these were adults. And then dental implant was, I guess nobody had a dental implant and tooth number one, tooth not present. So this would be like, okay, they don't have the tooth and permanent dental root fragrant present. That's kind of weird one could not assess and missing. So I would have to go through each of these. If I remember correctly, we didn't do the study, like we were working, my colleague and I were working with and Haynes, but we were looking at different things. And I think we just didn't have the patience to go through and add up the number of teeth from this. And so many were missing, like, see how this is 494, 494, 494. I would expect that these 494 are the same 494, you know, and that there's 494 less people we can take out of the demographic data set because, you know, if we were really going to count the number of teeth and want a valid number. So that's why I'm saying, like remember how I said, curb your enthusiasm. These data sets are not served up in a very easy to use way and I totally don't know why, because the government is using them for their own stuff. You know what I mean. Okay, so that was the examination data. Let's go into the laboratory data. And I have never used, sorry, you don't want to hear what I have to say. I've never used the laboratory data, but I have wanted to. Okay, but you can already see what the problem is. Like albumin and creatinine from a urine sample is separate from arsenic from the inner sample. Here's HDL. And here's LDL and triglycerides. So if you're, if you're a lip like a lipids person, like, oh, my olden days I used to be a lipids person, you know, in your data set, you're going to want HDL, triglycerides, LDL and total cholesterol. You want that in one data set. Okay. Now, if this HDL is measured in August 2021, and this LDL in October or whatever, or when was it measured, it's a date published. Like, I don't know if I can get these measurements on one person, like how many people have all three of these on them. And that's sort of what I'm going to cover in my subsequent events is like, how do you get a data set of these people where it's not missing? Like, here, I didn't have to look at this. So let's just look at this really quick. So see this code. So cute. This is which one did I click on? This is the HDL. HDL. Okay. So cute and only a few things. So here is the actual HDL measurement. And again, it's like, code or value is 5 to 189, but you don't know the distribution. And look, there's like 1370 missing, right? So remember that number 1370. We're going to back up. So that was HDL. Let's look at LDL. Okay. Okay. Well, this has got a little bit more. Here's triglycerides. Let's go. Okay. Missing is 440. So it's like, okay, well, that's actually better than the last one. But this is kind of like, so if you're like, well, Monica, I don't even know if I could get like, can you do lab studies? You actually need a lot of different labs and you need to look at their correlated, you know, you know, I don't know if you can actually put together. I hate to say this, but I don't know if you could put together like a real actual analysis with like exposure and outcome. And it's like a subpopulation and be able to put all the confounders in and do everything you need to do. But, and this is the fun part that I promised you. I think you could do a lot of really interesting data studies, like in portfolio projects like visualizations of especially this lab data. So the problem, if you're an epidemiologist is you're seeing a lot of selection bias, right? Like why are those 440 missing? Why is it different? Who's different from each one? So that's selection bias. So those of you who are pragmatic who say, well, why don't I just take the demographic data? And if I want this from this exam data, BMI from this, and labs from that, let's say I start out with, you know, 5,000 people. And just to get the filled in variables, I get it down to about 200 people. They're the ones with all the fill in variables. And I'll be like, those are the 200 healthiest people probably. Those are the 200 A students. That's what I say. That's what selection bias is, basically, is that you're trying to do a study on a whole bunch of people, but you only get complete data on a few people because they're A students in your study. So I'm like, okay, I don't think I don't feel comfortable about any, and then plus there's waiting. How can you put the weights in for the states and all that on top of all this, right? And so I don't really trust NHANES data myself. I don't trust myself to use it. Maybe somebody else can use it. But I do trust myself to use it for, like, as real life data that you can take a look at relationships. You can look at relationships between the different labs, you know, among like among the selection biased data set, you can study it descriptively, right? That's what is kind of cool. So if you're like, okay, I want to do a portfolio project to help data science. And I want to have fun joining data sets and looking at selection bias and NHANES is perfect. But then after that, you could probably make some cool visualizations. You could probably look at the correlations between this lab data. Let me get on to this questionnaire data. That's sort of the last tab I was going to over. Again, this is all fragmented. There's different number of people in each data set. Some of the questions are kind of lame, like you would not at want that question. But some of the questions are really cool and interesting and not everybody's doing anything with them. Like look at this early childhood stuff, health insurance, hepatitis, access to care. I mean, look at this. All this stuff is in here and people aren't really doing much with it. And I believe it's because of the selection bias problem because you just really can't do a lot of inference, right? But you can, like I said, look at relationships. You can say, okay, this is a bias sample, but let's just look at it's cross-sectional. So you can't really do any, you know, say this caused this because you're just measuring it all at once. You can make some sort of deductions and maybe you're just kind of saying this is an example of some American phenomena, the way their nutrition is. I don't know. But yeah, so this is mainly what I wanted to teach you today is how to use that code book, how to think about those different data sets, and how to sort of approach designing a study with NHANES data knowing that the first thing you're going to have to do is do a whole bunch of joins and see where you're missing data are. And you're probably going to have selection bias is probably not a good data set for making any big inferences, but it can be a great data set for just looking for relationships. If you weren't here at the beginning, let me introduce you to my free online workshop and application basics. Well, you just got done seeing me navigating an application, right? How did that get there? Why does it look like Windows XP? What should it be nowadays? Well, if you want to answer all those questions for yourself and have a lot of fun doing it, please show up on Saturday and Sunday for my workshop. It's all data scientists who want to learn more about applications and like how the theme is integrating application pipelines. And it's how the applications sort of fit together, how they're linked together. Kind of the way I was just describing this NHANES data set, how does it all come together, right? And the reason why you want to learn that as a data scientist is because you're probably going to be analyzing data, like this NHANES data set is totally old fashioned, like where we go around doing surveillance. Probably the data you're going to be analyzing comes out of apps, right? Health apps or medical records apps. So you really want to understand, like have a basic understanding, a literacy about how they're designed and how they run and how they're pipeline together so you know how to get your data out and you know how to analyze it and interpret it. So the application basics workshop is based on an online course that I developed that is one of the core courses in my public health data science online mentoring program. So if you're interested in that, you know, just contact me and I'll talk to you about the mentoring program. We'll see if it's a good fit for you. But in any case, this is my contact information and hopefully you'll download the slides for this monica. This is helpful. Any thoughts on using BRFSS compared to NHANES? Okay, I will give you my summary. Okay. If the BRFSS core questions have good measurements for what you're trying to study, use BRFSS. If the data are not available, because maybe you need measured BMI or you need a lab or whatever, try to use NHANES. Okay. So in other words, if you have a research question that you think can be answered with the variables of BRFSS, the answer is yes, it can be answered with those variables. Go ahead and answer it with that. Don't play with NHANES. But if you can't answer it with BRFSS because you just need variables from NHANES like LDL, cholesterol or whatever, then you have to go through down this rabbit hole I'm describing. And if you show up at the subsequent events and also read that blog post, the blog post basically goes over what I'm going to go over in the subsequent events, which is the fact that you have to really study what variables are present in a whole row of data. Like how many records can you get with all your variables filled in? And do you feel comfortable enough with that data set to actually answer any questions? So it's sort of like when you approach NHANES, if you're forced into NHANES because BRFSS doesn't want to have what you want, you then have to be sort of forced into a contingency plan because the data is not going to be that stable or that big. I hope that that answers your question. Any other questions before we leave? I encourage you just keep, I just started a company page on LinkedIn. I realize it's better to run events, you know, I'm teaching by application, I don't even understand them myself. Like LinkedIn keeps changing on me. But I found it's better to run events from a company page and also I can post the videos on it, you know, when I finish the videos. So I appreciate if you would follow my company page and check back because I'm going to always have these events on there in workshops and any learning goodies that you want. All right. Well, thank you very much for showing up today to learn about how you can use NHANES data for something, or at least a portfolio project, even if it doesn't have everything you want in it. And I hope to see you at my future events and I especially hope to see you at the workshop. All right, have a good week.