 We're live. Hello, everybody. Happy weekend. Happy Saturday. That I drew in your Saturday tag about statistics. Har, har, har. Well, it's Monica Wahee here and today we're gonna be talking about causal inference. But I always start the stream about five minutes early so everybody can join. And today I'm first gonna talk a little bit about just causal inference in my field of public health. Like so what I mean by causal inference is trying to figure out what causes what, especially in the context of a study where when you're doing the analysis at the end to try and decide if your hypothesis is supported or not or whether or not to reject the null. You've got this regression equation and you've got all these slopes in it and you're like, okay, what do I do with all these slopes? I mean, you have a study design. You're trying to do something. It's just like, at that point, I remember shortly after graduating with my MPH and trying to do that, I was just like, whoa, like, okay, I did it right, but what do you say? I mean, how do you figure this out? Oh, hi, CJ is here. Hey, everybody. CJ is early. Here he is in the chat here. He is my friend. I'm in Boston. He's also in Boston. So if you've been listening to my videos lately, you'll notice they have, some of them they have techno music in it, but some of them they have this really beautiful guitar music. That's CJ. He's been really nice to me because he's been giving me some music because he just produces so much music. Actually, let me put a link in the chat to his Spotify because he's actually, like you think I'm prolific? He's very prolific, sure. And so yeah, go to that link and just see, he posts songs like all the time. His songs, he's singing in them. He plays all these instruments. On my stuff, I requested some instrumental to go behind my talking and whatever. And already, like if you go to my YouTube channel, those of you from LinkedIn, you'll see that people like his music but then they like my videos. So that's another story. But he's also a writer. So go ahead and look up his book. I don't have a link to that, but I read his book. I thought it was really good. But anyway, so we're two businesses in Boston. If you need data science, it's me. If you need beautiful music, paintings, poetry, prose, it's CJ there. Yep, yeah, you're welcome CJ. I really appreciate all you've done for my business. Alrighty, so that was CJ and I'll probably bring him up a little bit later too because again, he's given me this great music for my videos, so I just wanna pay it back. But anyway, so like one of the videos I just released is just me just running GG Plot to code. It can be helpful for the beginner. Like if you use GG Plot to all the time, you'll be like, okay, well, this is kind of asthma. But if you're new, it kind of shows you how you build a GG Plot to plot. Like you don't, it's not like base R where you just put all these options in these arguments and parentheses and just run the whole thing. No, it's like you run something and then you add an annotation and then you add another rectangle or whatever you're trying to make with your plot. So that was the whole point of the video and also to just have something running in the background if I needed it. All right, so let's see here. You know what I could do is I could just share the beginning of the presentation here. I made a presentation for us. Yeah, there it is. So it's big up there. Just because I wanted to sort of give the background before I actually go into it because I was realizing I was talking to other people. There you are. I was just thinking of you, Joe. Hi, so this is like, Joe's here. So I'm really glad. I'm glad you're here because when I was talking to you, Joe, I was talking to Joe. Joe is not in public health but he makes regression models. And so I was asking him, have you ever heard of this causal criteria? And he said no because he's in automotive and then I forgot what you said when I asked if you used other criteria. I think you said you used different study designs or something. It wasn't something like obvious to me because he's from engineering and I'm from public health. So even though we're using the same software and doing the same statistics, we just use different tools for interpretation. So okay, so here's, I'm gonna start our little presentation here. Probably I should, does this do any better? I don't even know if this looks good on the screen. No, this doesn't even look good. Let's see here. I'm trying to get it so that, let's see here, place settings, swap here, maybe this looks better. I'm sorry, you're still my guinea pigs here. Let's just look like, oh, it doesn't really matter what it does on the restream. I'm trying to get it to like actually project but I'm gonna not. I'm gonna give up. I'm just gonna go like this and just make it big like that. Does that work? I already can see that. Okay, sounds good. All right, so I'll go back to this. So applied causal inference sounds very sexy, right? Using Bradford Hill criteria. So CJ, if you're still here, you probably know some of these criteria from some of the work we've done together. So first I just wanna set the stage. What are we doing in causal inference? So in epidemiology, which is what I'm in, we, this is the terminology we use. We say we have an exposure, which is something we think causes or prevents an outcome. Now when epidemiology was invented, they didn't use the term outcome. They use the term disease because what they were trying to do is figure out like if you're smoking and you're eating bad and you're not exercising, what's causing your heart attack? Well, obviously all those things cause it, right? And so what are the relative importance of this? Like if you only have one public health dollar because you're in, I don't know, Florida and you can only spend it on one exposure, which of those is it? I mean, I hate to be so blunt, but that was, that's public health training for you because you can't, you know, you gotta break some eggs to make an omelet. You can't make all everybody's risk factors go away. So you'll pick one of these exposures to study, whether it's not eating right or not exercising or smoking or something. And then we'll pick on that as the hypothesis for a study, okay? So like here, I put exposure, lack of exercise, outcome, heart attack. So obviously we're gonna have to measure more than just lack of exercise. We're gonna have to measure all the confounders. Now I was talking to Daniel, my other friend on LinkedIn, I have all these friends in LinkedIn and he's already like really good at SAS. He already does regression, but now this is more like a way of thinking about interpreting the regression. And so he told me, he was a little challenged by the concept of confounding and actually everybody is really a concept. Oh, Joe says, for everybody's benefit, I use Pareto to study events over 12 months of trailing data. That's really interesting. You know, I might get back to what he said about Pareto because I have a funny story about it. It's what we were talking about. But anyway, so let me go back to my presentation. So like I was telling Joe, confounders are the things you adjust for. So they're associated with the exposure and associated with the outcome, but they're not on the causal pathway. Causal pathway always throws people, but an example of a really common confounder is being low income. So if you're low income in the US, you don't really get an opportunity to exercise. Exercise takes time and you have to have the right kind of space. Even going walking in a neighborhood can be bad if you're in a low income neighborhood that also happens to have a lot of crime where it's dangerous outside or there's just no sidewalks. I'm making fun of Florida, but seriously, Florida doesn't have sidewalks except for in very old places. And of course that's where I went to live and tamp on the old part of Tampa so I could walk around. It's too dangerous. You'll walk on the street. And so if you ride a bike on the street, you'll get hit. So you can't, non surprisingly, Florida has a higher obesity rate than places like Massachusetts where I am now where it's very walkable. So when you have like the exposure of lack of exercise if you think about Florida, well, if you can't really go on the streets and walk and use your bike, what else is going wrong down there? So there's a lot of exposures. There's a lot of causes of disease that are floating around in Florida. A lot of causes of heart attack, like drug use is high. And so how are you gonna study if you say, well, I have this hypothesis that the lack of exercise in Florida, among Floridians, let's say, is the cause of heart attack, you're going to have to measure other things. Like for example, low income. So if you are higher income in Florida, maybe you can afford to go to the gym and you have time to go to the gym. But if you're lower income, you probably don't. And you probably can't go exercise outside to the state. And if you're higher income, maybe you can go to Publix and get their organic food and really eat well and therefore also prevent a heart attack. You could do all these other things when you're high income that you can't do when you're low income to prevent your heart attack. So that's why low income, it's like imagine a little arrow going towards heart attack and a little arrow going towards lack of exercise. Low income causes kind of both of them independently. But it's not like, oh, somebody just stops exercising and then they become low income. And because of that, they get a heart attack. That's the causal pathway thing. So what would be on the causal pathway and therefore not a confounder would be like, if lack of exercise causes you to like increase a biomarker like some biomarker in your body that leads to heart attack. Like I'm trying to think like, I think AP little A is a bad thing. I don't know if it's increased by lack of exercise, but let's say lack of exercise causes this lab or this protein in your blood to go up that causes heart attack. Then that would be, that protein thing would be on the causal pathway. The only problem with it is you wouldn't adjust for it in the regression model. Like if you for some reason took that lab and you said, oh, that's high. Well, then I'd be like, yeah, but you can't adjust for it because it's on the causal pathway. So, you know, somebody might come to me and say, well, what if you are healthy and you exercise and then you stop and that causes you to lose your job and become low income and that causes a heart attack. Then I'd sit and stay up all night and think, well, what if maybe it is on the causal pathway? So we spend all our time trying to figure out, well, what do we even put in the model? What's on a causal pathway and what's not, right? And usually what you end up with on a causal pathway is just if it's some sort of like biological or medical thing that happens. Like if you have lack of exercise and then it causes diabetes and that causes an amputation. You know, if you're saying amputation, getting diabetes, you wouldn't adjust for that because it's on the causal pathway. Let me just go, this is a hard concept. Let me see if anybody's got some questions here. Oh, triglycerides, that's a good one. Oh, Daniel is here. I was just talking about you. Hopefully you heard only good stuff, only good stuff. I was saying how you were asking about confounding because it's hard, right? It's a hard concept. And I was just kind of going over it here and giving some examples, hopefully you got it. So basically in causal inference in public health, what we're normally doing is we've got, we've measured all these exposures and we've measured these confounders. We've measured anything we think might be a confounder. Even if later we don't put in the model or later we decide it's on the causal pathway, we can't have it, just measure all your confounders. If you think it's a confounder, measure it. Please, just for me, because I can't put it in the model, I can't even try it if you don't measure it. Okay, so that's one of the exercises I do with people is I'm like, okay, what's our exposure? What's our outcome and what are all our confounders? And if you're wondering, how do you find them? What I often do is just look at the literature. So I start with the literature, but if somebody's coming me and talking to me about actual patients we're gonna measure or something like that, they often tell me what the confounders are. Maybe there's something local that happens. One of the things that I didn't really realize because I don't drink a lot of alcohol is that there are people in the world who they don't like to smoke tobacco, but they do like it, they sort of crave it when the few times they go out drinking. So maybe they'll go out drinking once or twice a month and that time they'll smoke. This is a pattern I didn't even know about. And so then you could have, if you were studying drinking, confounding by smoking of heart attack. And so you'd have to adjust for that. So it comes to a point where if you're studying heart attack we already know everything you at least have to measure as a confounder and you're probably wondering, well, what if I do a study of heart attack and I measure lack of exercise and then I also measure other things we know that cause a heart attack like tobacco. Well, if I wanna study tobacco, then I'd say opera or I, my exposure is associated with or causes or prevents my outcome after controlling for compounders. If I was studying tobacco, I'd say among, we'll just say Floridians, among Floridians, among the group I'm studying. Tobacco use causes heart attacks or it's associated with heart attacks and it depends on your studies, which I'll get to after controlling for confounders. So I'd, in my opera or I hypothesis, I'd say, okay, I'm fingering smoking after adjusting for confounders. But then I could take the same data and do the same analysis. I could even just basically take the same analysis. People don't really do this and I'll tell you why, but I could take the same regression model and I could just make a new opera or a hypothesis. Of course you can't do that, right? Because the model's already there. So it can't be opera or it would have to be opera or seriary. But I could take the same regression model and just look at the lack of exercise slope and say, oh, that's my exposure instead of the tobacco one. So this is why opera or I is really important is because you want to make this hypothesis. You want to write down what is your exposure? What is your outcome? Who is your subpopulation? Which usually is sort of like structurally determined. And what are all my confounders? And you've got to be able to measure all that or get all of that. Like if you're analyzing data from like a health insurance or you're going to have to get all of that, all of that data. And that's where a problem can often come when you're doing, using administrative data. Like one of the things people fantasize about is medical record data. They're like, oh, this is really good. But one of the problems with medical record data is there are a lot of confounders that I need to know about when I'm doing public health analysis, but they don't do a really good job of collecting in medical records because they don't need it. A lot of times use of tobacco, drugs and stuff, that's not very well characterized in a medical record and that's fine for medicine. It's just not good for public health. So you have to be really careful about your data that you can make sure to operationalize all these pieces if you're like, whoa, I've heard this before. I say this a lot in my LinkedIn learning courses on study design because obviously that's what we're talking about. All right, so does everybody then understand what we're talking about here? So we've got this Opry or I hypothesis and we've designated an exposure, the thing we think we cause it. We've got all these confounders. We're going to put them in a regression model and now our main question that we're trying to answer is did the exposure actually cause the outcome after adjusting for everything, right? So let's go to the next slide. Okay, so just to remind you in your statistical model, you make your outcome a dependent variable. So if it's heart attack, what is it going to be? It's going to be a binary one. So what are you going to use? Probably you're going to use logistic regression, but if you have time to event data, you might use a survival analysis. That's a whole nother topic. For anybody out there who ever does that with survival analysis, I encourage you to always compare it to a vanilla logistic regression model. Just like make fit your survival analysis however you want and then take the same independent independent variables and put it in a logistic regression and just look at the difference in slopes because if you're using a Cox proportional hazard regression, you're looking at hazard ratios and think about it that anybody who understands hazard, like that's weird, like people don't have hazard. People have odds, we see that every day, but people really don't have hazard. I just think that's weird. And so I often see that when I do what I do, this logistic regression, I feel like the hazard ratio seems to overstate. I feel like it overstates a lot of the slopes. And I feel that also happens, I've seen that happen. Like there's other ways I think you can blow up these slopes. I don't know what the, why, maybe it's a problem and maybe it's not. Like you could say, well, your logistic regression slopes are too small, but the slopes I see from logistic regression, I just feel like are more logical, which I'll get into with this because that has something to do with interpreting your model. Like one of the things you just should realize that is if you do a regression model and it looks goofy, it is goofy, it's goofy. So what does goofy mean? Either you screwed up, there's something screwed up with your data, there's something wrong. Or there's not something wrong and there's something really weird going on in the world. And it's your job to figure out which of those two is happening, okay? And most of the time it's the first one, but every once in a while, there's something goofy going on in the world. So you really have to do like a forensic investigation kind of, but so simplistically, here we go. You've got some regression model, the outcome, if it's binary, it's some survival analysis or logistic regression. If it's continuous, you're probably doing linear regression. And then you make the exposure and independent variable. So it can be categorical, it can be multi-level, categorical, it can be continuous. I don't like continuous exposures, you know how I am, but you can use it. And then also, the other independent variables you put in your regression model are your confounders. And that's what you're using to fit the model. Like the exposure is kind of like, you had to buy the exposure with your Opry declaration, you're like, okay, I'm stuck with this exposure. So the exposure you put on the model and it just gets beat up. It just, you beat up that exposure. You keep throwing, you know, I like, like, I called it forward step-wise, but some people say that's not the right name. It's ambidectional. And I think what Hosmer calls it, you know, the Dr. Hosmer, the statistician, I'm trying to remember selective or step-wise selection. I don't know. You just put these confounders, you've already decided which confounders are candidate confounders for the model and you've got them ready. And you just keep putting them in and taking them out and you'll see my modeling process. I write about it, but you know what I can do is when I usually go through my live streams and I annotate, like I put something in the description, I'll link to this article I use that Hosmer. It's actually by a guy named Bersack, but Hosmer's at the end of the list. And he calls it step-wise selection. And so what are you doing? You're putting these confounders and you're beating up that exposure variable. And at the end of the confounders eat up all of the significance and the exposure slope is just not significant like it said alpha at point 05. And you can set alpha for your confounders for keeping them at point one or something more loosey-goosey. It doesn't really matter. I mean, it's just, you just wanna make sure you got everything in your model because what's left for your exposure is what you're gonna interpret, right? And if your exposure is not statistically significant in your final model, you were wrong and you reject the null or don't reject the null, keep the null, keep it around. If it's statistically significant at the end of your modeling, then you keep it. But that all happens before we do causal inference because if it's not statistically significant when you're done and you follow your modeling rules, you've got the null, so you have nothing to interpret. But if you, you know, whatever modeling process you choose, whatever you do at the end, you're gonna have this regression model with a bunch of slopes in it. You're gonna have the exposure, your confounders and that's it for your slopes. And they're all regressed against your outcome, however you did it. And so the slope for your exposure is statistically significant, if it's not, and remember you have to set this all up at the beginning what your alpha is and all that and your modeling process. You gotta say all that before you do it. I always write it down. I write it down and I show the other person, I'm gonna say I'm gonna do this, I'm gonna do this because otherwise people get mad. But anyway, it's good to write it down because sometimes you forget. But when you're done, if the slope for the exposure is not statistically significant, you're wrong about your hypothesis, do not reject the null, keep the null, null's good. But you might notice if you use step-wise selection, it's easy to tell what aided up. Like if your crude model, meaning your unadjusted model, like your model with just the exposure and now come in it, has this huge slope and then you start adding these confounders and they eat up the slope, you kinda know which confounder did it, right? If you're playing with it. And so that kinda tells you that your idea was confounded by that. Is that's fault probably. In fact, that probably has a big slope. So that's why your modeling process is so important. I'm not into this Monte Carlo, put a set and forget it modeling and they just tell you what the results are. I like to actually model so I can see, who's fighting for what, for the variance as I'm putting stuff in and out of the model. But if your slope is statistically significant and for your exposure, you definitely have to interpret it. Like then you interpret the slope, okay? Even if your slope was big, but it was not statistically significant for your exposure, you throw it away, throw it away. Don't interpret it, you're not allowed to. It's illegal. But if it is statistically significant, it's illegal not to interpret it. You have to do it right, right? And so causal inference, that's what we're doing. And you could always interpret the other slopes for fun. I say for fun. If you're writing a paper where you're trying to answer your hypothesis, you really have no obligation to show the other slopes. However, I found myself when I review papers and I can't see the other slopes, I'm like, okay, I don't know how to, I don't know the context of the slope. So, and I've also had situations where I don't show the other slopes and people, the reviewers, you know, cause I'm writing, they say, where are the other slopes? I'm like, all right here. You know, it's just a huge table, but it's a useful table. So I don't know, it just kind of depends on who your audience is. Let me see if there's any questions here. In the engineering, we refer to goofy data as noise. Well, I would not say that it's, that's what Joe says. We've got noise too. But our noise, our data is your noise, Joe. What our noise is, you're right, goofy data in engineering would be noise, you know, because it's an artifact of something that happened. What noise is for us is like more, like if you're analyzing healthcare encounters, which are records of somebody coming to healthcare and getting some sort of, you know, health thing. If you're looking at those and you find a bunch of goofy encounters, they're usually important. They're not noise. They're like, they're doing something. But sometimes they are noise. Like they're just something that comes out of some automated process that you're supposed to remove from the data. So stuff is like one person's noise is the other person's analytic table. That's how messed up everything is in healthcare. So you guys, for all your foibles as engineers, you have a lot of really organized protocols that we would love to have in healthcare. But the problem is, you know, healthcare, like people are biological and they're just so hard to predict. I mean, even, you know, just trends in surgeries, trends in cancers, trends in drug addiction, like, you know, humans trend things and they aren't like machines that just get out of like a standard or whatever and you have to adjust them. Like you cannot adjust humans. Trust me, I've been public health for a long time. All right, so now what I'm gonna do is I'm gonna talk to you about the Bradford Hill criteria, okay? Now Bradford Hill was this guy who is a philosopher. So what does that mean? That means he came up with these ideas but he himself did not do like evidence-based anything about it. Like he didn't go and gather data and do a bunch of analytics. But his idea was so good, you know, he published it, I don't know, I never read his original stuff. But his ideas were so good, people started using them and just using them. And as a result, different kinds of people, you know, doing different kinds of things found different ways of using it that was useful. And in fact, if you go on the web, like I even, Daniel's here, I went on the web and I gave him this, oh, I'm so happy to see you, my good friend along with the detoxification. I could always count on you to make my day. Oops, I'm not good at this. All right, well, I went on the web, let me actually show you what I found because I wanted to give Daniel a more bigger view of what this is. So in epidemiology, like you'll see this following, you'll see these in a totally different order than what I'm gonna show you. And some of them are different, they're even stated differently. And this is Elwood's criteria, which, so Elwood rift on this, right? So other people, I'd never even seen this before, other people rift on, what I'm trying to impart to you, and I'll give you that link too, is that there's no real official, like there's the official Bradford Hill criteria that he came up with and he said a bunch of things about, but if you go read it now, you'll be like, this is weird. I mean, some of the stuff he says it sounds great and some of it, it just doesn't even sound right. So that's why people don't read it anymore, because after he said it, everybody tried it and worked with it, and he was in epidemiology. So this is why it was applied in epidemiology first. So we've been beating up the Bradford Hill criteria for a long time, and then all of us have our favorites in our own way, just like in statistics, there's, like I'm wary of survival analysis in human studies, of course, in Joe's studies and his engineering studies, survival analysis is perfect, it's what it's for, right? In my studies, I don't know, because our outcomes, like if you get diabetes, like how do you know, will you go get diagnosed? So you probably got diabetes earlier than that. So what does time do event even mean? I could go on and on, but anyway, so we're in public health, we are taught these Bradford Hill criteria, but in our MPHs, but to be honest with you, I didn't really see people, I don't see people with literature applying it, because you get these regression equations all the time in the literature. I don't see people talking about it a lot, like when I'm talking with my colleagues and I'm fitting regression models, I bring it up all the time, but in a way, I kind of threw out a lot of the criteria that was just not really working out for people. And so you'll see that webpage sees it differently, I see it this way. So I'm gonna share with you how I do it, and I think the way I do it is really practical because what it does is it cues you up for the discussion section, for results and discussion. You know how to make an argument about your exposure, because remember, you're only applying this if your exposure is significant, so you're trying to make people agree that your exposure is important. Now, how important that's what this is about. So these are the five criteria that I'm gonna talk about, and if you're like, oh, you're calling that the wrong name, or you gave that the wrong definition, these are different names are floating around in different definitions. I'm just saying this so you know what it is, okay? And I put them in order, because one of the things that they don't really teach you is that if you apply them in order, it's easier to make a decision about how you feel about the exposure. It's just like if you're rating something on multiple scales, right? It's easier if you rate the first thing you like to rate is the most important thing to you because then it sort of tells you the other things, like how are you gonna deal with the other things? Like if I'm grading papers, that's what I'll do is I'll grade the most important stuff first and then worry about the other stuff. So the first one I always apply is temporality. Now remember, this is about the exposure to outcome relationship. So did the smoking precede the heart attack? Did the lack of exercise precede the heart attack? Or could they have happened at the same time? Or could the heart attack have happened and that led to the smoking or that led to the lack of exercise? You're probably wondering, how could you make such stupid mistakes in epidemiology? Well, certain study designs don't allow you to actually do a really good measurement of what order the exposure showed up in and the outcome showed up in. Some study designs just don't do that. So I'll tell you, because I'm gonna go through a few examples of how I handle that, okay? So temporality is always your assessment of how likely it was the exposure preceded the outcome. If you think it's pretty likely, then I'd be like, okay, I'm feeling really good about this significant exposure. I'm feeling like this is really a cause, okay? If I'm iffy about it, I'll be like, okay, well, it's on probation, I don't know. Obviously, if I know it didn't, I wouldn't have done the study. So I'm either iffy about it or I think it did. The next one is called dose response gradient. And I'm gonna explain to you, so dose response gradient is, if you have a gradation of exposure, like I talked about smoking, like how much you smoke per day? Like you had more, how much you drink? Or like for some reason, it's not just yes, no. It's like how much do you, how much fruits and vegetables? If you measure your exposure that way, which if you have a continuous exposure, it's good to do a few categories, or even I'll take like a measurement like blood pressure and I'll cut it up into a few categories, then you can look at the dose response gradient because that's the idea that the more exposure you have, the more outcome, more risk of the outcome. So the more you smoke, the higher your risk of heart attack. Or it makes more sense. You usually people smoking is sort of a threshold. People, you know when it comes to tobacco, you know, but when it comes to things like alcohol consumption is on a gradation. You know, some people just socially drink more than others. And then you can really see associations. You know, what I always see is people who drink alcohol, they have less periodontitis. And I have so many associates from Saudi Arabia in the oral health field that like, why is that Monica? I'm like, they're drinking mouthwash, I think. Nobody's really teased it out, but you'll always see it. So dose response gradient is, the more you have the exposure, the higher risk of the outcome. Or if it's an inverse, the more exposure you have, the lower your risk of the outcome. The more training you have, the lower your risk of an accident. You know, something like that. So one of the problems is that when I get to this criterion, a lot of times I can't assess it because exposure is not even in a dose response gradient. Like it's not in a gradient. So sometimes you just can't use the second. Okay, then the third one I get to is strength association. And this is super important. All this is, is how big the slope is. And I'm usually doing logistic regression. So it's usually odds ratios. And I'm like really into odds ratios over 1.5. To me, that's a big slope. But I temper this with, it's kind of important to look at the other slopes in the regression. Case control is one of the study designs that we do this with. It's the study design you use when you have rare outcomes, like you don't have a lot of outcomes. And it's also what you use with outbreak investigations. That's what you use. And you're actually not modeling the odds. You usually use the odds ratio. You're not modeling the odds of the outcome. You're modeling the odds that these people with the outcome had whatever exposure you're testing. And so you just throw a bunch of models down because you're trying to figure out, was it the frosting? Was it the cake? Was it the, you know, whoever got sick at the wedding? You know, what was the food at the wedding? What was it contaminated with? Who ate what and got sick? You know what I mean? And so you just keep putting the different exposures in and seeing which ones have a goofy high odds ratio like 88, you know? And so in a case control study, those of you who actually study the stuff, know case control odds ratios, your estimates are super biased and super dramatic. But what can you do? You need it. You got to do an outbreak investigation. And so I still say, even when you're not doing case control today, you have to look at that slope. If it's big, it's a big deal. Like it really probably is a cause if it's big. I mean, I hate to be so simplistic, but that's it. Now, especially if you get a big odds ratio or a big estimate, you know, how are you doing your regression? Number four is really important because the number four is consistency. You just kind of want to see, are you getting an odds ratio? Even if it's big, do other people get big ones? Or I keep saying odds ratio, like whatever your estimate is. Does your estimate kind of look like what other people are getting? And that's kind of a weird question because if you're analyzing something, like I was talking about Saudi Arabia a minute ago, you're going to find different things in different countries. Like people just smoke like chimneys in some countries. And people aren't really smoking in the US now. They're vaping, you know, when it comes to tobacco. So it's like you're not really sure what you're going to get, but you really want to look around and convince yourself that you got this big estimate, you know, by looking in the literature and looking, and even talking with other people who analyze data like that. Finally, I call it coherence. I think that if it sounds stupid, it probably is, okay? So I'll give you an example that sounded stupid. I couldn't find the article. So people had taken this dietary data from a surveillance study, I think it was in Europe. And they also had gotten outcome data of deaths. So they were going to do model, like the dependent variable was death. And then the predictor variables, independent variables, they picked potato consumption for some reason as the exposure. And they found that if you eat potatoes that are baked, you had no risk, increased risk of mortality. Or I think it was the other way. It was either baked, it was and fried it wasn't or vice versa, I can't remember. But there was some huge odds ratio or some huge estimate risk if it was cooked one way but not the other way. And it was in a country, a European country where they're not doing something weird with potatoes. It was like Britain or something. And I was just like, I wouldn't have reported that out. Like that just doesn't even make sense. And I've heard a lot, like I used to work at the army and I ran this data warehouse there. And I looked at some, a study that had been done by a team that had run it before. And to their credit, the people, that team was very smart. And whoever wrote this was like not really that connected with them. They were not, this was a really dumb article but it said that smoking in the army among army people causes suicide. And I just don't know why. And what I think it was, it was confounded by depression. I think people who were depressed in the army were smoking more and it showed this high odds ratio for mortality or for suicide. I don't know. It was a coherence problem. It just didn't make sense. And people say to me, Monica, it got published. And I'm like, yeah, there's a lot of crap in the literature. You gotta be careful. You might get published. So anyway, so okay. So that's, I just was looking back to see if you guys have any questions. Okay, so now we're gonna go on and I'm gonna show you two examples. So this is an article. And I actually pulled these out of my library because I do a lot of writing and a lot of analysis lately I've been doing it. So this is from an article, the association of marijuana use with oral HPV infection and periodontitis among Hispanic adults. So we could already see something, right? We can see that there's probably two models in this. There's gonna be one with the exposure of, let's see, marijuana use is the exposure. And oral HPV infection is gonna be one of the outcomes. And periodontitis is gonna be another outcome among Hispanic adults. And that's our group. So here, so I read the article and I'll link you to it. These are, and I have nothing to do with that. I'm just interpreting. These are committing to deling Hispanics in Puerto Rico. Okay, so that's our subpopulation. The exposure was marijuana use. Now, let's just talk about that measurement, right? Like this was in 2018, so it might not be all the old ones from the nineties of measurements like just this classification. You know, like who's gonna say that they're gonna use marijuana if they're gonna be a victim of the drug war? Like no one, so okay. So you can't study stuff if you're, you can't say drug use of the drug war is going on. So all of our old data, who knows what it means. But here in 2018, maybe they got good answers. I forgot, I should have read the article more about marijuana restriction. But anyway, so they got this measure of marijuana use. Like they said, no use occasional frequent. And the no use is gonna be the reference. They did logistic regression because it was a cross-sexual study, right? So they did some sort of survey. Their outcomes, here were their outcomes. In their first model, and they were, they just said what they put in as the adjusters. Now, this was, I think there was like, I can't remember how many people were in this. This wasn't a very big study. So it doesn't surprise me. There aren't a lot of variables in here. When you have a big dataset, you really wanna adjust out everything you can. So they put in sex, age, healthcare coverage, oral sex partners, periodontal disease. I guess these are the confounders you do. I've never really studied that before. I don't know why I have this in my library. But anyway, it seemed like a good study. Now, I'm gonna go over here. This is the odds ratio in the 95% conference interval. See how it says crude and adjusted. See how these are bold? They shouldn't be bold. Like this shouldn't be interpreted, okay? This is the unadjusted odds ratio. Like it's okay to show it. Like people, it's nice to show it because see how it's huge here. And after you put all the stuff in, it's not even significant anymore, right? And that's what's kind of nice about putting the confidence intervals for an odds ratio for that your outcome is if the confidence interval contains one, you know it's not significant, right? So like here's the estimate. It looks like if you have occasional use compared to no use, you have 61% higher odds of getting oral HPV infection, right? Like I'm interpreting this, but that's illegal, I should go to jail because this is not statistically significant. So it really didn't matter. It didn't matter whether they smoked. Like here, this is even a little weird, like see how huge, this is bounded by zero down here. So this can go up to infinity. So this is like not news. But let's move on to periodontitis. Now, here's a story with periodontitis. If you smoke tobacco, it rots your gums. I mean, I can go into the detail about what rot means, but that's basically what happens, it rots your gums. So people who smoke, they lose their teeth because of that. And so if you happen to smoke tobacco, quit. But if you can't quit, constantly brush your teeth. And I actually know some people who do that in Massachusetts where we try to be very healthy, but it's better to just quit. However, we do not know what happens when you smoke marijuana or vape marijuana or vape tobacco. We don't know what happens periodontitis risk-wise. So because, so I think that's maybe why this was in my literature library because I was trying to look for evidence of like what does that do? Because I have this hypothesis that there's a sort of a antagonistic interaction because there's anti-inflammatory stuff in weed smoke. So it won't be as dramatic, the gum rotting, that's my hypothesis, but anyway. So when we look at, again, this is a crude odds ratio, I just reject it, I wouldn't even interpret it, I think that's a mistake. The adjusted one has all this in it, and this is a lot. So I feel good. I feel like occasional use doesn't do anything, but it looks like frequent marijuana use, and I'm assuming this is smoking because this was in 2018. You have a really big, like I, this is strength of association, okay? So now I'm gonna interpret this, okay? So let's go and interpret this, this 2.90, okay? So the bottom of this confidence interval is above one. It's a little bit above one. This top is really far. And what that means is they probably didn't have that many frequent smokers in here compared to the reference group. So you've got a wide confidence interval. This is almost three, it's in there somewhere, I believe it, like only 5% of the time, like if you do 100 of these studies, five of them would have the estimate, the odds ratio outside the 95% confidence interval. So I really believe that there's something going on here. So because of that, I'm gonna interpret this estimate. So let's go back, whoops, so temporality, that's the first thing. So do I think that they were doing frequent marijuana smoking before they got severe periodontitis? Well, I know that severe periodontitis takes a while to get. You can't, you don't get it in one day. It's a chronic disease and it takes a while before you get it. So I'm pretty, and I also know that people who smoke marijuana frequently tend to continue frequently. They don't go to occasional, they quit or they keep it. So I believe that, and also people don't get periodontitis unless they have a genetic disorder. They don't get it until they beat up their teeth like over their 20s. So I feel like everything I know from literature suggests that the marijuana use would have preceded the periodontitis. So I really believe that. I believe their marijuana use preceded the development of their periodontitis. So I'm gonna go, sorry, keep flipping this around. So I'm gonna say, yeah, I believe that, all right? So our next one is dose response gradient. So the more marijuana, the more periodontitis, we don't know, right? Right here, it doesn't look like it. What it looks like here is we have a bent stick model or there's a threshold here that over a certain amount, if you do it every day, and that would be kind of like consistent with what I know, I know I'm skipping ahead. But it seems like the estimates for tobacco use in periodontitis are bigger than for marijuana even for the same amount of use. So that's why I have this hypothesis of the antagonistic interaction. Okay, so now we're gonna go with strength of association, but I already kind of talked about how big that association was, it was like 2.9, close to three, even with wide conference intervals, I believe it. And then the coherence, if it sounds stupid, it probably is, well, this does not sound stupid to me. This is actually really coherent with what I believe of a hypothesis and what I've seen in the literature. So I would interpret this and I would believe it, that this is a causal thing that's frequent, smoking of marijuana among people, community deling Hispanics and Puerto Rico. Yeah, that probably increases their risk of severe periodontitis. I would tell them that, I'd say cut down. If you're occasional, you're probably okay. That's what this is, all right? So any questions before I move on? All right, let me go back. So now we're gonna do the next study, all right? So the next study is chronic kidney disease. So chronic kidney disease and use of dental services in a United States public healthcare system, a retrospective cohort study, okay? So this, so I'll just remind you, you know, I was gonna go back and there's three kinds of studies that I usually have a regression equation on. The first is this cross-sectional and that's when all of the variables are measured at the same time or like within a week of each other. Like you might give them a questionnaire and the next day they go to the doctor and you measure something, you know, it's all within the same amount of period. So you're getting point estimates and that way you don't know what came first, the exposure of the disease. You can ask them questions, like did you start smoking marijuana when you were young? You know, you can do that, but then that's all recall bias and there's all these issues. Okay, that's cross-sectional. The one I was talking about, which I didn't have an example for, the case control. How you do that one is you get your cases, you get a list of cases like an outbreak investigation, people got sick and you get a set of controls which are people who in a parallel universe would have gotten sick and been identified if they had done what the case did. I know that's sort of counterfactually, philosophically, brain numbing. But that's actually what you're trying to do is you're trying to get these controls for the case control that meet this criteria that if they just had the exposure they would have been a case and you would have found. Okay, so that's hard to explain. But anyway, in that one I'm doing this. Only the, in this case the odds ratio was the odds of the outcome of severe periodontitis. Severe periodontitis is actually not that common. So I could do a case control study asking if marijuana use causes severe periodontitis but then I would start with people with severe periodontitis. I'd go get controls that didn't have it. I'd measure their marijuana use and I'd have an exposure odds ratio here. But if this was an exposure odds ratio I'd still interpret it the same way because I'm like, it's huge, whatever. I just say that it's an exposure odds ratio not an outcomes odds ratio. Now, this one here we're doing is the third type which is a cohort study. It's a longitudinal cohort study. We love these, okay. We absolutely love these in public health because well, first of all they're big and they're expensive and you get a lot of papers out of them and they're fun to work on. You have a job for years. I mean, they're just the awesomeness. But another reason to love them is our temporality is no problem. You always measure the exposure before the outcome. In fact, once they get the outcome they're kicked out of the study only technically, when we do big cohort studies we have multiple outcomes. So they're like, oh, you got a heart attack that's too bad, but stick around for your Alzheimer's disease. So because then we can do nested case control studies with whoever gets Alzheimer's disease or whatever. So in the cohort study, here the exposure was actually a disease. So sometimes the exposure is a disease. Okay, I have a confusing field. So the exposure was having chronic kidney disease or CKD. So they got a bunch of people from the subpopulation over on the left side that San Francisco Department of Public Health Community Health Network clinical databases. So these are people who are using public health services, community health services in San Francisco that are in the database, right? This was a database study. So look it up, you can see how they did it, that data science, all right? So they got this database of people and they pulled out people who had chronic kidney disease and people who did not have it. So people with the exposure and without the exposure. And then their outcome was having at least one dental visit during the study period. So they were doing not a disease, they were doing like they were trying to say, what is having chronic kidney disease? Is that a risk factor for not going to the dentist or is it, are you more likely to go to the dentist if you have chronic kidney disease? Like that was their question. And then you'll see their confounders are over there on the right, but I wanna show you how they did their model. Oh, I'm sorry, they used hazard ratios, okay? So they did survival analysis because they had this whole database, they knew all the transactions, like all the encounters these people had, right? So their unadjusted model, that's the first one up there, that unadjusted one just had chronic kidney disease in it. And their hazard ratio, which you can interpret like an odds ratio, was 0.6. So that meant it was protective, like having chronic kidney disease. People who had chronic kidney disease were, let's see, for having a dental visit. No, hazard ratio for having a dental visit. No, they were protected against having a dental visit, which means they didn't have one. See, whenever I analyze data like this, I always make the outcome bad. I just make it bad. So I would have been, they had no dental visit and then there would have been a high odds ratio. But like here, you could flip this. I could take one divided by 0.6, the reciprocal, and get the odds of not having a dental visit. But this is dots for having a dental visit. So imagine having a dental visit was bad. You'd like to have chronic kidney disease because you only have 60% of the risk of having a dental visit. But the problem is having a kidney disease, you wanna have a dental visit. So having kidney disease, apparently reduce that. But that's the unadjusted. I would never interpret that, right? Then you see how they presented. They added age. They added gender and race and ethnicity and language, insurance and monthly income. And my understanding from reading the articles, they just kept entering them and showing us the exposure hazard ratio, which you can do, that's another way of doing it. I'm really only interested. If it were me, I'd only be interpreting the last one, the full model. But as you can see, because they were so nice and showed us all the behavior of the hazard ratio, once they put age in, it really didn't change much. And that's, I believe it. I believe it. So I would have flipped the odds ratio instead instead of saying, but at the end, now my interpretation is among patients in the San Francisco Department of Public Health, Community Health Network, among them, after adjusting for everything, having chronic kidney disease, it gives you only 75% of the, you're only 75% likely compared to those who don't have it to be up to date on your dental visit or have one dental visit during the study period, which I don't remember how long it was. Okay, so now that I've done that, I believe, and I believe it, right? I believe having chronic kidney disease delays you from getting dental care, because if you think about it, like imagine you do have kidney disease, you just feel like crap, like it would be hard to do anything. I mean, much less go to the dentist. Okay, so temporality, remember what I told you? This is a cohort study. So what they did, and they used a database. So what they did was they picked a year, like a long time ago. Like if I was doing it today, 2022, I'd pick like 2015 or 2016 or 2017. Like long enough time ago to get the data, but close enough in time to be relevant. And I just sample people that existed then, cases and controls, or not cases and controls, exposed or unexposed. And then I'd follow these exposed, and you have to clear out anybody who had had a dental visit, only I don't think you would in this case. Like usually if you have an outcome, you only can have once, like getting diagnosed with diabetes. I guess you could do it twice, but the first time I guess you can only do that once. Then you have to clear out people, but I guess they probably didn't clear out anybody. And then they just took them for some period of time, they followed them for some period of time through the data. And what you do is you just get the IDs from the list from the year you start, like 2015. And then you take those IDs and you go get the transactions or the encounters from all the other data sets just for those people. And you've got the exposed or unexposed in there and that's your analytic data set. You don't need anything else. You just need to make sure you've got the values for all the confounders. Because if you don't, you're gonna have to talk about that in your paper that maybe there's some bias you had to get rid of some records or impute data, which I hate doing, I don't do that. All right, so, but this one, chronic kidney disease, it did proceed the not going to the dentist, dose response gradient. Can't really look at this. With the marijuana thing, you could, but with this, you can't, like you have chronic disease. And I will sometimes say like, how many dental visits have you had in the last few years? Cause you're supposed to go once a year, but they didn't do that. So we don't really have that. Strength of association, let's, it's hard for me when it's on the protective scale, but I still think 0.7, being in the 0.7 range, that's pretty meaningful. Like it makes me wanna make the reciprocal, but I didn't do it. But I think so, anything that's in the 0.9 range, when it's a protective odds ratio, or you know, lower than one odds ratio, I always say is kind of close to one, but this one's 0.7, because it only has zero to one. So it doesn't have a very big space. Then consistency, you know, there's this association between people having diagnosed diseases and people attending appointments. So most of the time when I see a cohort of people who have diabetes or they have hypertension or something, they're more likely to go to their dental visits because they go to other visits. I just think chronic kidney disease that these people are maybe less likely because they're already like, they're already on the public dole here. This is already a stressed population. And so I think that with this group, yeah, having chronic kidney disease just keeps them from going to the dentist enough so that we could do something about this odds ratio. Like we could maybe give them transportation, like we would have to do some kind of root cause analysis. Maybe call some of the people back or just call them, because if they're being served by these clinics, you know, maybe find out why, what can we do as the dental clinic to get you in here? All right, so that was my little demonstration for today. Oh, Eugene's here. Thank you for showing up. Oh, you thought it was a good explanation, I'm glad. Here, let me unshare. I'm probably gonna post these slides. I figured people want them, but I didn't get a chance to. So just check back on YouTube, I usually put a whole bunch of, I always end up bringing up topics that I didn't think I would talk about. So I always end up putting links to things and also the timestamps. So if somebody wants some information, they don't have to listen to the whole video, they can find where I'm talking about it. But all right, I really appreciate everybody who showed up. And we still have a few minutes if anybody has any questions or comments, if you're still here, Joe, or if you're still here, Daniel, or a link to Toxification or anybody, if you have any questions or comments about it, I'd love to entertain them. But like I said, the Bradford Hill criteria, how I apply it is sort of quick and dirty, and it makes it so, I know what I'm gonna say in my discussion. I'll be like, well, this is a really big odds ratio. And I really believe it's true because the exposure preceded the outcome, or I'm not really sure if the exposure precedes the outcome, or you know what I mean? Like I just talk about that stuff to try and convince people that I think, my significant slope is a big deal in public health. And what I think we should do about it, that's always the whole point in public health is what should we do about it? If you live in the US, and you're probably like oral health, it would be nice to give people access. So yeah, that's pretty obvious, but in that ends up being the finding to most of my analyses in oral health, is that people need access to oral healthcare in other countries, like in Saudi Arabia, not that that's the perfect country in the world, but in Saudi Arabia as an example, they have community dental clinics, everything's not perfect there, but it's like there's this idea of regularly attending to your teeth. Now they don't have fluoridation in their water, so they have a lot of caries, which is what happens in your teeth rot, because you don't have fluoridation. So they've got their own problems, but at least they don't have this access problem. That's our problem in the US. All right, well, thank you very much for showing up and having me interpret these models for you. Hopefully now you know how to do it yourself and you can try out the Bradford Hill criteria for you. And I hope you have a good weekend, this is Saturday, and have a lot of fun doing your analyses and making your regressions and doing your data science.